The thesis on Sheriffing.
The purpose of sheriffing is threefold:
The expectations on a sheriff are as follows:
At the beginning of your stint as Sheriff, please perform the following tasks:
What should I do as I prepare to end my shift?At the end of your stint as Sheriff, please perform the following tasks:
How do I read the CI user interface?
How do I find out about build failures?
How do I deal with build failures?When Sheriffs encounter build failures on the public Chromium OS buildbot, they should follow the following process:
What bugs do I file, and how do I assign them?
Ahh, the tree is green. I can go back to my work, right?Wrong! When the tree is green, it’s a great time to start investigating and fixing all the niggling things that make it hard to sheriff the tree.
What should I do if I see a commit-queue run that I know is doomed?If there is a commit queue run that is known to be doomed due to a bad CL and you know which CL is bad, you can manually blame the bad CL to spare the innocent CLs from being rejected. Go to the Gerrit review page of the bad CL and set verify to -1. CQ will recognize the gesture and reject only the bad CL(s). If the commit queue run has encountered infrastructure flake and is doomed, most of the time CQ will not reject any CLs (with chromite CLs being the exception). If the commit queue run will fail due to a builder-specific problem, such as infrastructure failures, specific builders can be temporarily marked as experimental by adding "EXPERIMENTAL=something-paladin" to the tree status description. Also include a link to the bug describing the problem. See Tree Sherrifs documentation for more detail and examples. In other cases where it is not necessary to wait for the full results of the run, you can save developer time and hassle by aborting the current CQ run. How do I deal with a broken Chrome?If a bug in Chrome does not get caught by the Chrome PFQ, you should first engage with the Chrome gardener. They are responsible for helping to find and fix or revert the Chrome change that caused the problem.
If the Chrome bug is serious enough to be causing failures on the canaries or the CQ, you should temporarily pin Chrome back to the last working version with this procedure:
How can I revert a commit?If you've found a commit that broke the build, you can revert it using these steps:
Help with specific failure catagoriesHow do I investigate VMTest failures?There are several common reasons why the VMTests fail. First pull up the stdio link for the VMTest stage, and then check for each of the possibilities below.
Auto-test test failedOnce you've got the VMTest stage's stdio output loaded, search for 'Total PASS'. This will get you to the Autotest test summary. You'll see something like
Total PASS: 29/33 (87%)
Assuming the number is less than 100%, there was a failure in one of the autotests. Scroll backwards from 'Total PASS' to identify the specific test (or tests) that failed. You'll see something like this:
In this case Chrome failed to login for any of the 3 reasons: 1) could not find network, 2) could not get online, 3) could not show webui login prompt. Look for chrome log in /var/log/chrome/chrome, or find someone who works on UI.
(If you're annoyed by the long string before the test name, please consider working on crbug.com/313971, when you're gardening.)
Crash detectedSometimes, all the tests will pass, but one or more processes crashed during the test. Not all crashes are failures, as some tests are intended to test the crash system. However, if a problematic crash is detected, the VMTest stdio output will have something like this:
Crashes detected during testing: ---------------------------------------------------------- chrome sig 11 login_CryptohomeMounted ASAN error detectedThe x86-generic-asan and amd64-generic-asan builders instrument some programs (e.g. Chrome) with code to detect memory access errors. When an error is detected, ASAN prints error information, and terminates the process. Similarly to crashes, it is possible for all tests to pass even though a process terminated.
If Chrome triggers an ASAN error report, you'll see the message "Asan crash occurred. See asan_logs in Artifacts". As suggested in that message, you should download "asan_logs". See the next section, "How to find test results/crash reports" for details on how to download those logs.
Note: in addition to Chrome, several system daemons (e.g. shill) are built with ASAN instrumentation. However, we don't yet bubble up those errors in the test report. See crbug.com/314678 if you're interested in fixing that.
ssh failedThe test framework needs to log in to the VM, in order to do things like execute tests, and download log files. Sometimes, this fails. In these cases, we have no logs to work from, so we need the VM disk image instead.
You'll know that you're in this case if you see messages like this:
Connection timed out during banner exchange Connection timed out during banner exchange Failed to connect to virtual machine, retrying ... When this happens, look in the build report for "vm_disk" and "vm_image" links. These should be right after the "stdio" link. For example, if you're looking at the build report for "lumpy nightly chrome PFQ Build #3977" :
Download the disk and memory images, and then resume the VM using kvm on your workstation.
$ tar --use-compress-program=pbzip2 -xf \ $ tar --use-compress-program=pbzip2 -xf \ $ cros_start_vm \ You should now have a VM which has resumed at exactly the point where the test framework determined that it could not connect. Note that, at this time, we don't have an easy way to mount the VM filesystem, without booting it. If you're interested in improving that, please see crbug.com/313484.) For more information about troubleshooting VMs, see how to run Chrome OS image under VMs. How to find test results/crash reports?The complete results from VMTest runs are available on googlestorage, by clicking the [ Artifacts ] link in-line on the waterfall display in the report section:
How do I extract stack traces manually?Normally, you should never need to extract stack traces manually, because they will be included in the Artifacts link, as described above. However, if you need to, here's how:
If you successfully retrieve a stack trace, search for issues with a similar call stack and add the google storage link, or file a new issue.
Note that in addition to breakpad dmp files, the test_results.tgz also has raw linux core files. These can be loaded into gdb and can often produce better stack traces than minidump_stackwalk (eg. expanding all inlined frames).
A buildbot slave appears to be down (purple). What do I do?Probably nothing. Most of the time, when a slave is purple, that just indicates that it is restarting. Try wait a few minutes and it should go green on its own. If the slave doesn't restart on its own, contact the Deputy and ask them to fix the bot. platform_ToolchainOptions autotest is failing. What do I do?This test searches through all ELF binaries on the image and identifies binaries that have not been compiled with the correct hardened flags.
To find out what test is failing and how, look at the *.DEBUG log in your autotest directory. Do a grep -A10 FAILED *.DEBUG. You will find something like this:
05/08 09:23:33 DEBUG|platform_T:0083| Test Executable Stack 2 failures, 1 in whitelist, 1 in filtered, 0 new passes FAILED: /opt/google/chrome/pepper/libnetflixplugin2.so 05/08 09:23:33 ERROR|platform_T:0250| Test Executable Stack 1 failures FAILED: /path/to/binary
This means that the test called "Executable Stack" reported 2 failures, there is one entry in the whitelist of this test, and after filtering the failures through the whitelist, there is still a file. The name of the file is /path/to/binary.
The "new passes" indicate files that are in the whitelist but passed this time. To find the owner who wrote this test, do a git blame on this file: https://chromium.googlesource.com/chromiumos/third_party/autotest/+blame/master/client/site_tests/platform_ToolchainOptions/platform_ToolchainOptions.py and grep for the test name ("Executable Stack" in this case).
Find the change that added the new binary that fails the test, or changed compiler options for a package such that the test now fails, and revert it. File an issue on the author with the failure log, and CC the owner of the test (found by git blame above). Who should I contact regarding ARC++ issues?Visit go/arc++docs and see the Contact Information section. How are sheriff rotations scheduled? How do I make changes?Please see go/ChromeOS-Sheriff-Schedule for all details on scheduling, shift swaps, etc.
What do I do when I see a NotEnoughDutsError?When you see an error like: NotEnoughDutsError: Not enough DUTs for board: <board>, pool: <pool>; required: 4, found: 3, suite: au, build: <build> Contact the on duty Deputy to balance the pools, Sheriffs are responsible to ensure that there aren't bad changes that continue to take out DUTs, however Deputies are responsible for DUT allocation. Many boards are in red but I couldn't find out exact failure reason, just "_ShutdownException", what happened?It's possible waterfall just get restarted. Currently, waterfall will get restarted every Wednesday 11:00am, you can confirm it with deputy. It's expected. TipsYou can setup specific settings in the buildbot frontend so that you are watching what you want and also having it refresh as you want it. Navigate to the buildbot frontend you are watching and click on customize, in the upper left corner. From there you can select what builds to watch and how long between refreshes you want to wait. If looking at the logs in your browser is painful, you can actually vim the log's URL and vim will wget it and put you in a vim session to view it. Other handy links to information:
|
For Developers > Tree Sheriffs >