This page has details to help Chromium sheriffs. For more information and procedures, see Tree Sheriffs and read the theses on Sheriffing.
For passing the torch, you can also leave notes here.
How to prepare
Before you start sheriffing, make sure you have working build environments on all your machines. If you don't have hardware for some platforms, try borrowing it. You may need it to repro breaks locally. Know how to sync to a specific revision (in case you need to bisect to find the CL that introduced the break).
What to watch
- Failures-only waterfall. It will show you only the bots a sheriff would need to look at.
(A builder is considered failing if the last finished build was not successful, a step in the current build(s) failed, or if the builder is offline.)
- Console view to make sure we are not too much behind in the testing.
- Some sheriffs don't look at the waterfall at all, instead the open this console and choose [merge] at the bottom.
- LKGR status. Make sure it moves forward relatively often, as other trees depend on it.
- IRC #chromium on freenode.
- Be available on IM.
- Do not ignore the Reliability tester. It's very important for Chromium stability.
- Do not ignore ChromeOS bots. These bots build and run Chrome for ChromeOS on Linux and ChromiumOS respectively and are as important as win/mac/linux bots. If you're not sure how to fix an issue, feel free to contact ChromiumOS sheriffs.
- Do not ignore ASan bots. This is called "Memory waterfall" but is nevertheless required to be watched by the regular sheriffs. Bugs reported by ASan usually cause memory corruptions in the wild, so do not hesitate to revert or disable the failing test (ASan does not support suppressions).
- It's up to the main sheriffs to keep an eye on the Official waterfall.
When to close the tree
All bots that you need to watch are "tree closers". If any of them fails, the tree will be automatically closed by the gatekeeper (it will become red, and you'll see a "tree closed" message from trungl-bot in IRC). Then, you need to act:
- (Check for flakiness first!) A test occasionally goes red: Tree open
- This is a flaky test. If the change that made it flaky is obvious, revert the change.
- If the change isn't obvious (or the test is new), keep the tree open but disable the test and file/update a bug. See below for details.
- A test went red: Tree maybe closed
- If the cause is obvious (the FooShouldWork test broke, and someone just checked in changes to foo_utils.cc), the tree can stay open. Revert the change, sending the review to the person.
- If the cause isn't obvious, close the tree. Ask everyone on the blamelist to help track it down and revert the patch as soon as found.
- webkit_tests went red or got new regressions: Tree maybe closed
- Layout tests are just like other kinds of tests, except that sometimes we file and mark their new failures rather than fixing them right away. See below for details.
- One category of bot fails to build or has a swarm of test failures: Tree closed
- If all the debug, release, Vista, XP, etc. builds go red, act as with a single test going red.
- One bot went red: Tree open
- If only one buildbot is having problems (can't update, can't compile, exploding in some other way), the tree can stay open while it's fixed. We have reasonable redundant coverage now. Ask a trooper for help.
- An update failed: Tree maybe closed
- Try again from the internal waterfall. Ask for the url to colleague. If it keeps failing or gives a worrisome error, contact a trooper
- "extract build" is orange, or fails once: Tree open
- Orange "extract build" means it's using the latest built revision and not the one it's supposed to. If it does not work the second time, contact a trooper.
- A slave is hung at a step: Tree maybe closed
- If a slave hangs, sometimes just cancelling the build may not work. In that case call a trooper.
- Small insects crawling on stems and leaves seem to be eating sap: Tree infested
- The tree probably has aphids. Release ladybugs nearby to eat them.
Warning: before opening the tree for one failure, make sure that all the bots are green. if the tree's already closed then subsequent failures don't "reclose" so it can be easy to miss them.
Closing vs. throttling: if the reason for closing the tree is isolated and well-understood, you may choose to throttle it instead of closing. This will allow authors with urgent changes to land their CL after asking your permission.
- "Revert Patchset" button in Rietveld:
The button creates a Rietveld issue with the patchset inverted, adds keywords to immediately submit
via the CQ (NOTREECHECKS, NOTRY, TBR), and automatically checks the CQ checkbox. Design document is here.
If the button does not work because the patch cannot be applied or the CQ is down, then use drover (below).
$ cd $TMP_DIR; drover --revert 12345
$ git checkout trunk; git pull; gclient sync
$ git svn find-rev r12345 # -> a git hash
$ git checkout -b revert_foo trunk
$ cd $SRC # a gcl/svn repo
- The "Revert Patchset" button updates the original CL saying it is being reverted. If you use Drover, git or gcl/svn then please manually update the original CL. The author of the original CL must be notified that his/her CL has been reverted.
- Drover and "Revert Patchset" button in Rietveld do not work on files larger than ~900KB. If you need to revert a patch that modifies generated_resources.grd, for example, then use git or gcl/svn.
- Waiting for a fix it not a good idea. Just revert until it compiles again.
- If it's not clear why compile failed, contact a trooper.
- See Handling a failing test.
Checking whether a test is flaky
A practical way to check that a test is flaky is to find a link like this in the automated error email: http://build.chromium.org/p/chromium.memory/builders/Mac%20ASAN%2064%20Builder/builds/917. Click at it, and you'll get to the build page. At the top, click at the builder link, select 200 at the bottom, and see whether the failing test occurs from time to time.
See also http://chromium-build-logs.appspot.com/
Handling failing perf expectations (like the sizes step)
When a step turns red because perf expectations weren't met, use the instructions on the perf sheriffs page to give you information on how to handle it. It can also help to get in touch with the developer that landed the change along with the current perf sheriff to decide how to proceed. For sizes, the stdio output of the sizes step lists the static initializers found, diffing against a green run can find the culprit of that kind of sizes failure.
Coordinating WebKit breakages / fixes
Tips and Tricks
How to read the tree status at the top of the waterfall
The memory sheriff helps with tending the Memory FYI tree, and the webkit sheriff helps out with the Webkit bots.
- Chromium / Webkit / Modules rows contain all the bots on the main waterfall.
- Official and Memory bots are on separate waterfalls, but the view at the top show their status.
Merging the console view
If you want to know when revisions have been tested together, open the console view and click the "merge" link at the bottom.
- Open a GChat session with your fellow sheriffs. This is useful for coordinating outside of IRC. (e.g. lunch breaks, who will pursue what, etc)
- Open a shared GDoc and use it to track open issues. For example, if a test starts flaking, drop in the dashboard links. Take notes about your discoveries, CLs, crbugs, owners, etc. If anything outlasts your shift, put it in the Sheriff Log.
NOTE: If your shift spans a weekend, you aren't expected to sheriff on the weekend (you do have to sheriff on the other days, e.g. Friday and Monday). The same applies for holidays in your office.