For Developers‎ > ‎Tree Sheriffs‎ > ‎Perf Sheriffing‎ > ‎

Bisecting Performance Regressions

Summary

A Python script for automating the process of syncing, building, and testing commits for performance regressions.

The script works by syncing to both revisions specified on the command line, building them, and running performance tests to get baseline values for the bisection. After that, it will perform a binary search on the range of commits between the known good and bad revisions. If the pinpointed revision turns out to be a change in Blink, V8, or Skia, the script will attempt to gather a range of revisions for that repository and continue bisecting.

Note that for performance_ui_tests, testing will be faster if you give a specific test, such as ShutdownTest.SimpleUserQuit, rather than the overall suite, such as ShutdownTest.*.

The same performance bisect try server can also be used to try performance tests with particular patches.

Supported Platforms

 Platform Builder Name
 Linux linux_perf_bisect
 Windows win_perf_bisect
 Mac mac_perf_bisect
 Android GN android_gn_perf_bisect
 Android Nexus 4android_nexus4_perf_bisect
 Android Nexus 10
android_nexus10_perf_bisect
 ChromeOS In Development

Run on Trybot

The performance try server is tryserver.chromium.perf.

  • Create new git branch or check out existing branch.
  • Edit tools/run-bisect-perf-regression.cfg (instructions in file).
    • Take care to strip any src/ directories from the head of relative pathnames.
    • You can search through the stdio from performance tests for "run_benchmark" or "run_measurement" to help figure out the command to use.
    • Please note that you cannot specify a blink/v8/skia revision range directly. You can only specify a range of chrome revisions, and the tool will extract other revision ranges by looking at the .DEPS.git file.
    • Also note that cycle times for the bots can be vastly different. If it's possible to reproduce the regression on multiple platforms, keep in mind that linux bisects are usually the fastest, followed by windows, then mac, and finally android.
      • You can expect to wait about 2-3 hours for a linux bisect, and nearly 8 hours for an android bisect (this obviously depends greatly on the performance test you're running, and the size of the bisection range).
      • A good rule of thumb is that linux is the fastest, mac and windows will take 2-3x longer than linux, and android can take 3-4x longer
  • Commit your changes locally; if 2nd+ change to brach, amend via git commit --amend (else trychange.py can't determine the diff).

  • Run the try job like so (or substitute in win_perf_bisect, mac_perf_bisect, android_perf_bisect, android_nexus4_perf_bisect, android_nexus10_perf_bisect for the --bot parameter to run on Windows, Mac, Android Galaxy Nexus, Android Nexus 4, or Android Nexus 10, respectively):

  • git cl upload --bypass-hooks
  • git cl try --email=<email>@chromium.org -m tryserver.chromium.perf --bot=linux_perf_bisect
  • You can also see the results on the buildbot waterfall: bisect_bots.

  • The trybot will send an email on completion as a regular "Try Success" email, showing whether bisect was successful and linking to output (the Results stdio link at the bottom); there may also be some spurious emails.

    Also note that the trybots run on LKGR, not ToT. If you just a made change to the bisect script themselves, make sure you pass -r REV to ensure the bisect script contains your revision.


  • If the bot seems to be down or having issues, please ping a trooper.


Run Locally

You probably do not want to run locally, except to debug proper settings before sending to a try bot, or if running overnight, since the test will run in your local session and make your computer impossible to use. Further, you won't be able to run anything without interfering with the tests.
  • Recommended that you set power management features to "performance" mode.
    • For googlers:
sudo goobuntu-config set powermanagement performance
  • Run locally in private checkout (recommended way of running locally); first change to CHROMIUM_DIR="$CHROMIUM_ROOT/src"
cd "$CHROMIUM_DIR"
    • Edit tools/run-bisect-perf-regression.cfg (instructions in file)
tools/run-bisect-perf-regression.py -w .. -p "$GOMA_DIR"
      • -w Working directory to store private copy of depot
      • -p Path to goma's directory (optional).
  • Run locally from <chromium>/src:

  • tools/bisect-perf-regression.py -c "out/Release/performance_ui_tests --gtest_filter=PageCyclerTest.MozFile" -g 179776 -b 179782 -m times/t
    tools/bisect-perf-regression.py -c "out/Release/performance_ui_tests --gtest_filter=ShutdownTest.SimpleUserQuit" -g 1f6e67861535121c5c819c16a666f2436c207e7b -b b732f23b4f81c382db0b23b9035f3dadc7d925bb -m shutdown/simple-user-quit

Tips

  • Often you can get a clearer regression by looking at the other metrics in the same test. Ie. if this is a warm_times/page_load_time for a page_cycler regression, look at the individual pages. Often there's a page where the regression clearly stands out that you can bisect on.
  • With tests that suddenly become noisy, bisecting on changes in the mean isn't all that useful. There's a "bisect_mode" parameter in the config that allows you to specify "std_dev", and the bisect script will bisect on changes in standard deviation instead. There is currently no way to do this from the dashboard, so you'll have to initiate the bisect manually.
  • You can use the bisect script to find functional breakages as well. Specify "return_code" for the "bisect_mode". You can leave "metric" empty since this won't be used. There is currently no way to do this from the dashboard, so you'll have to initiate the bisect manually.

Gotchas

  • The bisect bot uses git under the hood, which means that if you suspect a Blink/V8/Skia roll, be sure that the range you specify includes when the .DEPS.git file was submitted.
    • The script will attempt to detect this situation and expand the range to include the .DEPS.git roll.
  • Remove src/ from paths in command
  • Remove --browser_executable flag and --output-trace-tag=_ref
  • Use --browser=release
  • blink_perf tests take too long. The easiest way is to just run one test is to pass the full path of that test.
    To run all tests:
    $ tools/perf/run_measurement blink_perf third_party/WebKit/PerformanceTests

    To run one test:
    $ tools/perf/run_measurement blink_perf third_party/WebKit/PerformanceTests/Parser/html-parser.html

Interpreting Output

Assuming the bisect succeeds, the culprit revision will be the First Bad Revision (see below), identified by its SHA-1 hash.
Remember that you can get the SVN revision number by running git log with this hash, as in:
git log -1 1e73bb01af4255521f957d3f2a603d179001226b

Note that this script does not distinguish regressions from progressions, (it just checks for a change, and doesn't check sign) so it can just as easily be used to check a fix.
If you have a suspect revision (say a proposed fix), say N, you can simply set good to be N  1 and bad to be N and run it; if there is a change, it will identify it.

Sample Output

Full results of bisection:
  chromium  1e73bb01af4255521f957d3f2a603d179001226b  0
  chromium  d8bc3ecdc9f5f95de468d3b24ec59b124694ae05  1
  chromium  78d298ce9a459a9f2b48b4d216b54ba3223d3e9c  1
  chromium  b80a2872e75c47b641b6a396ec1290707c1a416d  ?
  chromium  aa38a7c91d7322464a568d19a5ae27d7546702ad  ?
  chromium  257eafb91d1453baef7d929d5ea2fbd61b827d0a  1
  chromium  7f1f63fdadb4383fdaad7cab63e46bcc5210322e  ?
  chromium  663fe04ad401c63e5ae2b93b625b54be666b8002  ?
  chromium  9105a1e099a228599685ee769e82384128464e0e  ?
  chromium  e57644c05e1ba4ceab5dfece6fb12566a10302aa  ?
  chromium  edf48d44a71f3c49a0d0d5e7574c340e843a7791  ?
  chromium  a5a4e99ea504c2ee4f85b7d9cda746fcfcd3bf84  1
Tested commits:
  chromium  1e73bb01af4255521f957d3f2a603d179001226b  {'total: t': 138.24999999999997}
  chromium  d8bc3ecdc9f5f95de468d3b24ec59b124694ae05  {'total: t': 129.29375}
  chromium  78d298ce9a459a9f2b48b4d216b54ba3223d3e9c  {'total: t': 128.85}
  chromium  257eafb91d1453baef7d929d5ea2fbd61b827d0a  {'total: t': 128.74375}
  chromium  a5a4e99ea504c2ee4f85b7d9cda746fcfcd3bf84  {'total: t': 128.78125}

Results: Regression may have occurred in range:
  -> First Bad Revision: [1e73bb01af4255521f957d3f2a603d179001226b] [chromium]
  -> Last Good Revision: [d8bc3ecdc9f5f95de468d3b24ec59b124694ae05] [chromium]

Commit  : 1e73bb01af4255521f957d3f2a603d179001226b
Author  : <author>
Email   : <email>
Date    : <date>
Subject : <subject>
  • ? - Build was skipped
  • F - Build failed
  • 0 - Build was deemed "bad"
  • 1 - Build was deemed "good"
  • 'chromium' is the repository

Posting Results

In addition to using the results yourself, you usually want to share the results on the Issue Tracker, specifically when blaming a change and contacting the author, either when filing a new bug or commenting on an existing bug.

Ideally readers will be able to understand the output and how to reproduce the test themselves, notably to verify that it has been fixed. To that end, an ideal report will include:
  • Summary of results, namely the revision numbers (bad revision − 1, bad revision), and times (average & standard deviation): the key data from the above output (with SHA-1 replaced by Subversion revision numbers, for clarity).
  • The name of the test suite and metric, and exact command line to reproduce, as in:
Command to reproduce is:
out/Release/performance_ui_tests --gtest_filter=StartupTest.ProfilingScript1
metric is:
warm/profiling_scripts1-first
  • A diff so readers can run the try job themselves; this can be produced via:
cd "$CHROMIUM_DIR" && git diff > ../bisect-perf.diff
  • Command line to use, for reference
git try --bot=linux_perf_bisect --diff=../bisect-perf.diff --email=<email>@chromium.org
  • Link to the try bot results
  • The exact output, as above
When posting to verify that a regression has been fixed, it's useful to show the summary of times before/at the regression, and then before/at the fix, to verify that times changed, then changed back.

Technical Details

Assumptions

  • Has no real logic to decide if the <bad_revision> (specified on the command line) is actually a performance regression or not. Simply gathers reference values for both the <good_revision> and <bad_revision> specified and bisects based on that. Right now, after each revision is examined, if the average value of the metric in question is closer to the reference value for the <good_revision>, the revision passes, otherwise it fails.
    • On the special case of using the times/t metric, the script will check for and attempt to parse out a list of pages used for the test (ie. in PageCyclerTest). The logic for pass/fail in that case will be how many pages pass or fail.
  • You are running the script from <chromium>/src.
  • You have 'master' checked out on src.
  • You are on the LKGR (may not be HEAD)
  • (You may be able to instead use a .diff file, as above.)

Known Issues

  • Git workflow only
  • Only shows Git SHA-1 hashes, not Subversion revision numbers

Future

  • Look into t-tests for pass/fail logic


Comments