For Developers‎ > ‎Tree Sheriffs‎ > ‎

Sheriff FAQ: Chromium OS

The thesis on Sheriffing.

The purpose of sheriffing is threefold:

  1. Make sure build blocking failures are identified and addressed in a timely fashion.

  2. Manually watch over our build system in ways automation doesn’t/can’t do.

  3. Give developers a chance to learn a little more about how build works and breaks.

The expectations on a sheriff are as follows:

  1. Do not leave the tree unattended; coordinate with your co-sheriffs to ensure this. Normal work hours (for you) only.

  2. Address tree closure/throttling.

    1. Promptly update the tree status to show you are investigating.

    2. File or locate bugs. The tree should not close unless something broke. File new bugs, or update existing ones. Include links to the failed build. Raise priority as needed.

    3. Find owners for bugs that break your builds. It’s not the Sheriff’s job to fix things, only to find the right person. Build deputy/Lab Sheriff (listed on the waterfall) can help with this.

    4. Reopen when ready. If the next run can reasonably be expected to pass, reopen.

  3. Watch for build (especially CQ) Failures

    1. Mark CLs that fail a CQ run with Verified -1. If done before the CQ run finishes, this prevents innocent CLs from being blamed.

    2. File bugs for flake that fails a build. This is how we get build/infrastructure/test flake fixed, so it stops happening. Identify existing bugs where you can, or file new ones. Some bugs are auto filed, this should be called out in waterfalls or logs.

    3. Identify confusing output. If something about the waterfall or build results don’t make sense, file a bug. BE VERY SPECIFIC. This feedback needs to be actionable.

  4. While the tree is open, work on gardening tasks -- not your normal work items.


You can find the calendars here:

What should I do when I report for duty?

At the beginning of your stint as Sheriff, please perform the following tasks:

  1. Sign on to irc.freenode.net / #chromium-os, and introduce yourself as an on-duty sheriff.
  2. Pull up the public and internal Chrome OS buildbot waterfalls.  If the status is out of date, update it.
  3. Triage any buildbot failure messages in your Inbox.
  4. Read the email from yesterday’s Sheriffs and/or look at the Chromium OS sheriff log, and familiarize yourself with the TreeCloser issues they cite.

What should I do as I prepare to end my shift?

At the end of your stint as Sheriff, please perform the following tasks:
  1. Ensure that TreeCloser issues have been filed for any ongoing failures (like a flaky Chrome SEGV that is under investigation) or infrastructure problems (git hangs during sync, say)
  2. Update the Chromium OS sheriff log and Email chromium-os-dev@ with the list of issues and a blurb describing anything else of note -- perhaps you just reverted a bad change, and expect a bot to cycle green.

How do I read the waterfall?

  • What do the different tree statuses (open, throttled, closed) mean and do?

    • Open - ToT is believed to be a healthy, with builds that can make it through the commit queue and canary builders without failing. Commit queue will run unencumbered.

    • Throttled - Typically the automatic result of a builder failure. May mean the tree is in a bad state, but often cause by flake. The CQ stops testing new changes. An already-running CQ run will be allowed to submit changes, if it passes. Further runs of the CQ will only occur if there are patches marked as CQ+2 (normally by you).

    • Closed - ToT has been manually closed. Typically done to avoid adding to an already catastrophic failure or outage. Commit queue runs will not submit changes.

  • What is the Commit Queue?

    • Please read the Commit Queue overview.

    • The CQ Master decides success/failure for a given run. You mostly shouldn’t look at slaves except to see failure details. The Master should link to appropriate slaves.

  • What is a Canary Build?

    • An official build (with a unique version) is prepared and validated by canary builders three times a day.

    • The Canary Master decides success/failure for a given run. You mostly shouldn’t look at slaves, except to see failure details. The Master should link to appropriate slaves.

  • What is the Pre-CQ?

    • The Pre-CQ launches tryjobs to validate CLs before they enter the CQ, but doesn’t do anything other than launch them on standard tryjob servers.

  • What is a PFQ builder?

    • Please read the Pre Flight Queue FAQ. Contact the Chrome Gardener (listed on waterfall) if it stays red.

  • What is ASAN bot?

  • A toolchain buildbot is failing. What do I do?

    • These buildbots test the next version of the toolchain and are being sheriff'd by the toolchain team. Not your problem.

How do I find out about build failures?

  1. Watch the tree status. On the waterfall.

  2. Watch the waterfalls (internal and external)

  3. Warning emails. As Sheriff, you’ll get automated emails about various failures. You should always follow up on these emails, and file bugs or close the tree as appropriate.

How do I deal with build failures?

When Sheriffs encounter build failures on the public Chromium OS buildbot, they should follow the following process:

  1. Update the Tree Status Page to say you're working on the problem.

    • Example: Tree is throttled (build_packages failure on arm -> johnsheriff)

  2. See if you could fix it by reverting a recent patch

    • If the build or test failure has a likely culprit, contact the author.  If you can’t, revert!

    • Infrastructure failure (repo sync hang, archive build failure, etc)?  Contact a Trooper!

  3. Update the Tree Status Page with details on what is broken and who is working on it.

    • Example: Tree is throttled (libcros compile error -> jackcommitter, http://crosbug.com/1234).

  4. Make sure the issue is fixed.

    • If the build-breaker is taking more than a 5-10 minutes to land a fix, ask him to revert.

    • If the build-breaker isn’t responding, perform the revert yourself.

  5. Watch the next build to make sure it completes cleanly.

    • Buildbots only send email when they change in status from green to red, and the tree is open. So, if a buildbot is red, it won't send email about failures until it completes successfully at least once.

    • Sheriffs are responsible for watching buildbots and making sure that people are working on making them green.

    • If there's any red on the dashboard, the Sheriffs should be watching it closely to make sure that the issues are being fixed.  If there’s not, the Sheriffs should be working to improve the sheriffability of the tree

If you still need help determining what happened talk to the build deputy/email chromeos-build@

What bugs do I file, and how do I assign them?

  • If a test fails, or a specific component fails to compile, file a bug against that component normally. Assign it to an owner for that component.

  • If build infrastructure fails, file bug with go/cros-build-bug, and assign it to the current deputy.

  • If there is a HW Lab Infrastructure failure (NOT a test failure), go/cros-lab-bug and assign it to the lab sheriff.

Ahh, the tree is green.  I can go back to my work, right?

Wrong!  When the tree is green, it’s a great time to start investigating and fixing all the niggling things that make it hard to sheriff the tree.

  • Is there some red-herring of an error message?  Grep the source tree to find out where it’s coming from and make it go away.

  • Some nice-to-have piece of info to surface in the UI?  Talk to build deputy and figure out how to make that happen.

  • Some test that you wish was written and run in suite:smoke?  Go write it!

  • Has the tree been red a lot today?  Get started on your postmortem!

  • Still looking for stuff to do?  Try flipping through the Gardening Tasks.  Feel free to hack on and take any of them!

How can I pick a specific set of changes to go through a commit queue run?

If the tree is broken, and you have a change or set of changes that you believe should fix it, it may be desirable to put just that change or set of changes through the commit queue. This can be accomplished by:

  • Set the tree to Throttled.

  • Set the Commit-Queue value of the desired CLs to +2 (along with the usual CR+2 and V+1). This will allow the CLs to be picked up by the CQ even when the tree is throttled.

If the CLs pass the commit queue, they will be committed to the tree, and all the usual side-effects of a commit queue run will take place (such as ebuild revbumping and prebuilt generation).

What should I do if I see a commit-queue run that I know is doomed?

If there is a commit queue run that is known to be doomed due to a bad CL and you know which CL is bad, you can manually blame the bad CL to spare the innocent CLs from being rejected. Go to the Gerrit review page of the bad CL and set verify to -1. CQ will recognize the gesture and reject only the bad CL(s).


If the commit queue run has encountered infrastructure flake and is doomed, most of the time CQ will not reject any CLs (with chromite CLs being the exception).


In other cases where it is not necessary to wait for the full results of the run, you can save developer time and hassle by aborting the current cq run.

How can I revert a commit?

If you've found a commit that broke the build, you can revert it using these steps:

  1. Find the Gerrit link for the change in question

    • Gerrit lists recently merged changes. View the change in question.

    • In some cases (like the PFQ), you can see links to the CL that made it into the build straight from the waterfall.

  2. Click the “Revert Change” button and fill in the dialog box.

    • The dialog box includes some stock info, but you should add on to it.
      e.g. append a sentence such as: This broke the tree (xxx package on xxx bot).

  3. You are not done.  This has created an Open change for _you_ go find and submit it.

    • Add the author of the change as a reviewer of the revert, but push without an LGTM from the reviewer (just approve yourself and push).

Help with specific failure catagories

How do I investigate VMTest failures?

There are several common reasons why the VMTests fail. First pull up the stdio link for the VMTest stage, and then check for each of the possibilities below.

Auto-test test failed

Once you've got the VMTest stage's stdio output loaded, search for 'Total PASS'. This will get you to the Autotest test summary. You'll see something like
Total PASS: 29/33 (87%)
Assuming the number is less than 100%, there was a failure in one of the autotests. Scroll backwards from 'Total PASS' to identify the specific test (or tests) that failed. You'll see something like this:
/tmp/cbuildbotXXXXXX/test_harness/all/SimpleTestUpdateAndVerify/<...>/login_CryptohomeMounted            [  FAILED  ]
/tmp/cbuildbotXXXXXX/test_harness/all/SimpleTestUpdateAndVerify/<...>/login_CryptohomeMounted              FAIL: Unhandled JSONInterfaceError: Automation call {'username': 'performancetestaccount@gmail.com', 'password': 'perfsmurf', 'command': 'Login'} received empty response.  Perhaps the browser crashed.
In this case Chrome failed to login for any of the 3 reasons: 1) could not find network, 2) could not get online, 3) could not show webui login prompt. Look for chrome log in /var/log/chrome/chrome, or find someone who works on UI.
(If you're annoyed by the long string before the test name, please consider working on crbug.com/313971, when you're gardening.)

Crash detected

Sometimes, all the tests will pass, but one or more processes crashed during the test. Not all crashes are failures, as some tests are intended to test the crash system. However, if a problematic crash is detected, the VMTest stdio output will have something like this:
Crashes detected during testing:
----------------------------------------------------------
chrome sig 11
  login_CryptohomeMounted
If there is a crash, proceed to the next section, "How to find test results/crash reports"?

ASAN error detected

The x86-generic-asan and amd64-generic-asan builders instrument some programs (e.g. Chrome) with code to detect memory access errors. When an error is detected, ASAN prints error information, and terminates the process. Similarly to crashes, it is possible for all tests to pass even though a process terminated.

If Chrome triggers an ASAN error report, you'll see the message "Asan crash occurred. See asan_logs in Artifacts". As suggested in that message, you should download "asan_logs". See the next section, "How to find test results/crash reports" for details on how to download those logs.

Note: in addition to Chrome, several system daemons (e.g. shill) are built with ASAN instrumentation. However, we don't yet bubble up those errors in the test report. See crbug.com/314678 if you're interested in fixing that.

ssh failed

The test framework needs to log in to the VM, in order to do things like execute tests, and download log files. Sometimes, this fails. In these cases, we have no logs to work from, so we need the VM disk image instead.

You'll know that you're in this case if you see messages like this:

Connection timed out during banner exchange
Connection timed out during banner exchange
Failed to connect to virtual machine, retrying ... 
When this happens, look in the build report for "vm_disk" and "vm_image" links. These should be right after the "stdio" link. For example, if you're looking at the build report for "lumpy nightly chrome PFQ Build #3977" :


Download the disk and memory images, and then resume the VM using kvm on your workstation.

$ tar --use-compress-program=pbzip2 -xf \
    failed_SimpleTestUpdateAndVerify_1_update_chromiumos_qemu_disk.bin.8Fet3d.tar

$ tar --use-compress-program=pbzip2 -xf \
    failed_SimpleTestUpdateAndVerify_1_update_chromiumos_qemu_mem.bin.TgS3dn.tar

$ cros_start_vm \
    --image_path=chromiumos_qemu_disk.bin.8Fet3d \
    --mem_path=chromiumos_qemu_mem.bin.TgS3dn

You should now have a VM which has resumed at exactly the point where the test framework determined that it could not connect.

Note that, at this time, we don't have an easy way to mount the VM filesystem, without booting it. If you're interested in improving that, please see crbug.com/313484.)

How to find test results/crash reports?

The complete results from VMTest runs are available on googlestorage, by clicking the [ Artifacts ] link in-line on the waterfall display in the report section:

From there, you should see a file named chrome.*.dmp.txt that contains the crash log. Example


If you see a stack trace here, search for issues with a similar call stack and add the google storage link, or file a new issue.

How do I extract stack traces manually?

Normally, you should never need to extract stack traces manually, because they will be included in the Artifacts link, as described above. However, if you need to, here's how:
  1. Download and extract the test_results.tgz file from the artifact (above), and find the breakpad .dmp file.
  2. Find the build associated with your crash and download the file debug.tgz
    1. Generally the debug.tgz in the artifacts should be sufficient
    2. For official builds, see go/chromeos-images
    3. TODO: examples of how to find this for cautotest and trybot(?) failures
  3. Untar (tar xzf) this in a directory under the chroot, e.g. ~/chromeos/src/scripts/debug
  4. From inside the chroot, run the following: minidump_stackwalk [filename].dmp debug/breakpad > stack.txt 2>/dev/null
  5. stack.txt should now contain a call stack!
If you successfully retrieve a stack trace, search for issues with a similar call stack and add the google storage link, or file a new issue.

Note that in addition to breakpad dmp files, the test_results.tgz also has raw linux core files. These can be loaded into gdb and can often produce better stack traces than minidump_stackwalk (eg. expanding all inlined frames).

A buildbot slave appears to be down (purple). What do I do?

Probably nothing. Most of the time, when a slave is purple, that just indicates that it is restarting. Try wait a few minutes and it should go green on its own. If the slave doesn't restart on its own, contact the Deputy and ask them to fix the bot.

platform_ToolchainOptions autotest is failing. What do I do?

This test searches through all ELF binaries on the image and identifies binaries that have not been compiled with the correct hardened flags.

To find out what test is failing and how, look at the *.DEBUG log in your autotest directory. Do a grep -A10 FAILED *.DEBUG. You will find something like this:


05/08 09:23:33 DEBUG|platform_T:0083| Test Executable Stack 2 failures, 1 in whitelist, 1 in filtered, 0 new passes FAILED:

/opt/google/chrome/pepper/libnetflixplugin2.so

05/08 09:23:33 ERROR|platform_T:0250| Test Executable Stack 1 failures

FAILED:

/path/to/binary



This means that the test called "Executable Stack" reported 2 failures, there is one entry in the whitelist of this test, and after filtering the failures through the whitelist, there is still a file. The name of the file is /path/to/binary.


The "new passes" indicate files that are in the whitelist but passed this time.


To find the owner who wrote this test, do a git blame on this file: http://git.chromium.org/gitweb/?p=chromiumos/third_party/autotest.git;a=blob;f=client/site_tests/platform_ToolchainOptions/platform_ToolchainOptions.py;h=c1ab0c275a5995c2ad62eb9dd8ba677b5d10e5a2;hb=HEAD and grep for the test name ("Executable Stack" in this case).


Find the change that added the new binary that fails the test, or changed compiler options for a package such that the test now fails, and revert it.  File an issue on the author with the failure log, and CC the owner of the test (found by git blame above).

Tips and Tricks

Install helpful extensions.

You can setup specific settings in the buildbot frontend so that you are watching what you want and also having it refresh as you want it.


Navigate to the buildbot frontend you are watching and click on customize, in the upper left corner. From there you can select what builds to watch and how long between refreshes you want to wait.


If looking at the logs in your browser is painful, you can actually vim the log's URL and vim will wget it and put you in a vim session to view it.