For Developers‎ > ‎Tree Sheriffs‎ > ‎

Perf Sheriffing

This page has details to help Chromium perf sheriffs. For more information, see Tree Sheriffs and read the theses on Sheriffing.


Note that Chromium perf sheriffs are now also responsible for Android perf sheriffing. See https://sites.google.com/a/google.com/clank/engineering/android-perf-sheriffs



Goal

Keep performance regressions from making it out into the wild!  In particular, triage all of the Chromium Perf alerts and keep the Chromium Perf waterfall green.


The Chromium Perf waterfall runs many different tests against the latest Chrome to measure Chrome’s performance.  This data is automatically analyzed by the perf dashboard and will report anomalies via email to the current perf sheriff, perfbot-gasper@ and on the perf dashboard.  These anomalies likely correspond to a performance regression.

You'll be the last line of defense in upholding Chrome's core principle of speed.


"There is usually a way to add functionality without hurting page load time, startup time or other key metrics.  One just has to be a bit more clever at times." -Darin Fisher

What to Watch

If this if your first time sheriffing, please watch the perf sheriff overview video. It's almost an hour long, but has everything on this page and much more!


It is also recommended to hang out in #chromium on irc.freenode.net during your shift. Most chromium developers are there, so it can be useful for pinging people about issues.


Keep the perf sheriff tracking doc up to date with all known issues.

What to Do

Keep bots green

  • Purple bot? Purple probably indicates bot problems. However, bots also reboot between cycles so it may be normal. If there's are recent purple builds or the bot stays purple for more than 10 minutes, ping chrome-troopers at google dot com and file a bug with the Infra-Labs label.
  • Red or Orange bot? Red indicates first failure. Orange indicates repeat failure.
    • Scan the build history for the first failure. Click it and analyze the CLs in that build for what could have caused the failure.
    • File a P0 bug with the Performance and Type-Bug-Regression labels. Make sure it contains a link to the failing step build log.
    • If you are able to identify the culprit change with reasonable certainty, add the author to the bug and in the meantime politely revert the CL directly with drover. No need to wait for the author to confirm in the case of a breakage.
    • If you are not able to identify the culprit, try to reproduce locally and/or CC possible culprit authors on the bug. Consider also reaching out to chrome-perf-sheriffs at google dot com and/or tonyg.
    • If the failure appears to be flake, file a P2 bug with the Performance and Type-Bug-Regression labels. Feel free to investigate only if you have time.

By the end of your shift, try to leave the next sheriff with a green tree!

Triage dashboard alerts


  • At the start of your shift, it's a good idea to look over what bugs were reported recently. 

  • If you're not signed in, click "sign in" in the top right corner. Use your google.com account, since internal-only data will only be visible then.

  • Visit the alerts dashboard page and make sure that the "Chromium Perf Sheriff" rotation is selected; now you can start to triage groups of related alerts.

    • You can sort the alerts table by clicking on the table column headers.
    • Check one or more related alerts, and clicking the "Graph" button.
    • Examine the existing graphs to see whether they appear to be caused by the same root regression, and close out the graphs that are not part of the same issue.
    • Look in the list of alerts to see whether there are any for which a bug has already been filed; if so, associate the new alerts with the existing bug. Otherwise, file a new bug.
  • Remember, the perf dashboard is not as smart as you -- in some cases you may need to mark alerts as invalid, or nudge them. Watch out for the following patterns:
    • If the alert looks like noise, mark the alert "Invalid" and don't file a bug. If it looks like the noise level sharply increased recently, you may file a bug to track the noise increase. If you see many invalid alerts on the same graph, you may file a bug to have the noisy test disabled, and/or update the anomaly threshold settings to be less sensitive.
    • If the reference trace moved by the same amount as the target trace, this means something likely changed in the test or the bot, not in Chrome. Mark the alert "Invalid".
    • If the alert is not placed on the first bad data point, use the "Nudge" option the alert into that position. In the example below, the alert is incorrectly placed on the last good data point. It requires a Nudge +1 to the right. Remember, the Bisect button and bug title will only work if you nudge into the right spot first.
    • If the alert is a revert of a very recent improvement, choose the "Ignore Alert" option. You have bigger fish to fry.

    • If the regression has already recovered at tip of tree, choose the "Ignore Alert" option. No sense in spending time on it.

  • After going through the above, you will be left with alerts that may be for real regressions.
  • It is worth spending considerable effort at this point to group alerts that appear to have the same underlying cause. Look for similar sets of tests that regressed at the same revision range and associate them to the same bug.
  • If the alert is a real regression, file a bug and track down the right owner (see detailed instructions under "Diagnose regressions" below).
By the end of your shift, try to leave the next sheriff with an empty alerts page!

Diagnose regressions

As perf sheriff, you are responsible for following through with the regression bugs filed during your shift. Assign yourself as the owner until you find a more appropriate owner. Be sure the bugs have the labels "Performance" and "Type-Bug-Regression" so they show up in our weekly triage meetings. To further prevent severe regressions from slipping through the cracks, consider applying milestone and release block labels to the bug.
  1. Use the "Bisect" button on the dashboard to kick off a bisect job to pinpoint the culprit change. The dashboard will run the bisect and update the bug with the results when complete.
  2. When the bug gets the results, verify that they make sense. The magnitude of the regression detected by the bisect should roughly match that on the dashboard and the confidence should be high. If not, change something about the bisect and try again. For example: more iterations, a wider revision range, a different platform, or a more specific test (e.g. one page of the page_cycler or one suite of dromaeo).
  3. Another
  4. Once you find the culprit change, assign the bug to the author and ask them to revert. Chrome has a no-regression policy as specified in our core principle of speed. Because there are sometimes tradeoffs involved, or other considerations, it usually is best to let the author do this rather than doing it yourself. If the author claims the regression is expected but you have any doubts, feel free to loop in rschoen/tonyg or reach out to chrome-perf-sheriffs for more support.
Handle sizes regressions

  • If sizes fails and the jump is > 100K or static initializer fails, talk to the build sheriffs in #chromium about identifying the culprit and reverting.
  • If sizes fails and the jump is < 100K, update the expectations.

  1. cd src/tools/perf_expectations [alternatively: svn co svn://svn.chromium.org/chrome/trunk/src/tools/perf_expectations/]
  2. Edit perf_expectations.json, a file that contains lines that look like the following:

    "linux-release-64/sizes/nacl_helper-text/text": {"reva": 222892, "revb": 222899, "type": "absolute", "better": "lower", "improve": 1893929, "regress": 2093291, "sha1": "217d6c9a"},

    The red values are ones you'll update; the blue ones will be updated by a script.
  3. Find the line corresponding to the new failure.  For example, if you're looking at a failure on the perf dashboard in linux-release-64, the sizes test, and a failing expectation for 'chrome/chrome', the line you need to update will have the key 'linux-release-64/sizes/chrome/chrome'

  4. Select a new reva and revb. This should be a range of commits (x-axis values) that exhibit the new value of the metric; regardless of whether the metric rose or fell, you want to include two points on the new plateau. It is good to include at least 50 revisions in your range; this allows make_expectations.py to get a sense of how noisy the metric is.

  5. For that test result, update reva and revb to match the new range and save your changes to the file. When make_expectations.py computes values for regress and improve, it will use a tolerance (5%, by default) around the actual dashboard y-axis values over the interval you've specified. A handful of metrics use a non-default tolerance. Typically there is no need to touch any of the other fields; make_expectations.py will do that for you.

  6. Run 'make_expectations.py', this updates perf_expectations.json with new results values and a new checksum for that line.

  7. Open a new bug (example) with label Performance and cc all authors of CLs in the regression revision range.  Assign the bug to the first or most likely author in the list and ask them all to verify if their change could have introduced the regression.

  8. Upload a CL (example) with your change.   Like the example, include in the commit message the key that is being updated along with a working persistent link to the change on the graph.  Make sure the presubmit tests run, since these tests check for common syntax errors.  Send your CL for review to the current perf sheriffs (check the calendar if you're unsure). Include at the bottom of the description:

    BUG=<your new bug>
  9. Publish your CL on codereview.chromium.org so that it can be reviewed.

  10. If the tree is open, land your CL then notify the sheriff(s).

  11. If the tree is closed, check with the sheriffs if you are free to land your change.  Let them review it if they want, in addition to the TBR.

Join Us

If you'd like to help keep the Chromium Perf waterfall green, send email to rschoen@chromium.org. The rotation lasts 3 days, once per quarter.

Calendar

You can view the current rotation by adding this calendar: google.com_2fpmo740pd1unrui9d7cgpbg2k%40group.calendar.google.com.  Instructions on swapping shifts are here.

References

Comments