For Developers‎ > ‎Tree Sheriffs‎ > ‎

Sheriff FAQ: Chromium OS

The thesis on Sheriffing.

The purpose of sheriffing is threefold:

  1. Make sure build blocking failures are identified and addressed in a timely fashion.

  2. Manually watch over our build system in ways automation doesn’t/can’t do.

  3. Give developers a chance to learn a little more about how build works and breaks.

The expectations on a sheriff are as follows:

  1. Do not leave the tree unattended; coordinate with your co-sheriffs to ensure this. Normal work hours (for you) only.

  2. Annotate build failures in the Build Annotator

    1. Keep this user interface open during your entire shift. Refresh it periodically to look for new CQ failures. (Hint: do not click the Update List button; that will force latest_build_id in the URL and you won't see new build results.)

    2. Promptly update the master-paladin build outcome notes to show you are investigating.

    3. File or locate bugs. File new bugs, or update existing ones. You can file a bug from the CI UI: on a failed build, click the "File a Bug" link at the top. Include links in the failed build annotation. A bad CL doesn't need a bug unless it slips past the CQ and breaks ToT. Raise priority as needed.

    4. Record the root cause of a failure in the Annotator. Set the "Blame URL" to a crbug.com or crrev.com URL in most cases. IMPORTANT: click "Finalize annotations?" checkbox or Exonerator won't work!

    5. Find owners for bugs that break your builds. It’s not the Sheriff’s job to fix things, only to find the right person.

    6. If you don't do this, the CL Exonerator won't work! Every time you mark a build failure root cause, you save potentially a hundred engineers from having to CQ+1 their CL's again.

  3. Watch for build (especially CQ) failures

    1. Mark CLs that fail a CQ run with Verified -1. If done before the CQ run finishes, this prevents innocent CLs from being blamed.

    2. File bugs for flake that fails a build. This is how we get build/infrastructure/test flake fixed, so it stops happening. Identify existing bugs where you can, or file new ones. Some bugs are auto filed, this should be called out in waterfalls or logs. go/cros-infra-bug

    3. Identify confusing output. If something about the waterfall or build results don’t make sense, file a bug. BE VERY SPECIFIC. This feedback needs to be actionable.

  4. When a specific builder is in a very bad state that cannot be corrected immediately, mark it experimental. This is done by adding the builder name to the list at go/cros-experimental. Include a comment that links to an open bug. When the bug is resolved and the builder is stable, remove the target from the experimental list and add it to the _paladin_important_boards list.
  5. While the tree is green, work on gardening tasks -- not your normal work items.

How do I join/leave the rotation?

Where do I ask questions and/or post answers?

Please use YAQS. If you think of something that might be helpful for the next sheriff feel free to ask the question and answer it yourself. Please vote up the questions you find useful so that they bubble up to the top.

What should I do when I report for duty?

At the beginning of your stint as Sheriff, please perform the following tasks:

  1. Join the #crosoncall channel on comlink or irc.corp.google.com and introduce yourself as an on-duty sheriff. 
  2. Pull up the CI user interface for the CQ. Periodically refresh it to watch for failures. Also occasionally check the Release builders statuses. You are responsible for both being green.
  3. Annotate build failures in the Build Annotator. Keep this user interface open during your entire shift. Refresh it periodically to look for new CQ failures.
  4. Triage any failures. You are responsible for identifying issues that need attention and finding someone to pay attention to them. Fixing issues yourself is secondary.
  5. Attend the Monday weekly handoff meeting. You will have received an invite to this weekly meeting where sheriffs from the previous week provide a handoff on current issues.

What should I do as I prepare to end my shift?

At the end of your stint as Sheriff, please perform the following tasks:
  1. Attend the following week's Monday handoff meeting.

How do I read the CI user interface?

  • What is the Commit Queue? (*-paladin)

    • Please read the Commit Queue overview.

    • The CQ Master decides success/failure for a given run. You mostly shouldn’t look at slaves except to see failure details. The Master should link to appropriate slaves.

  • What is a Canary Build?

    • An official build (with a unique version) is prepared and validated by canary builders three times a day.

    • The Canary Master decides success/failure for a given run. You mostly shouldn’t look at slaves, except to see failure details. The Master should link to appropriate slaves.

  • What is the Pre-CQ?

    • The Pre-CQ launches tryjobs to validate CLs before they enter the CQ, but doesn’t do anything other than launch them on standard tryjob servers. Note that you can find the overview of Pre-CQ builds in the UI, as well.

  • What is a PFQ builder?

    • Please read the Pre Flight Queue FAQ. Contact the Chrome Gardener (listed on waterfall) if it stays red.

  • What is ASAN bot?

  • A toolchain buildbot is failing. What do I do?

    • These buildbots test the next version of the toolchain and are being sheriff'd by the toolchain team. Not your problem.

How do I find out about build failures?

  1. Watch the CI user interface or

  2. Watch the Annotator user interface

How do I deal with build failures?

When Sheriffs encounter build failures on the public Chromium OS buildbot, they should follow the following process:

  1. See if you could fix it by reverting a recent patch

    • If the build or test failure has a likely culprit, contact the author.  If you can’t, revert!

    • Infrastructure build failure (repo sync hang, archive build failure, build_packages etc)?  Contact a build deputy oncall!

    • For any other infrastructure failures, contact a CI oncall!

  2. Make sure the issue is fixed.

    • If the build-breaker is taking more than a 5-10 minutes to land a fix, ask him/her to revert.

    • If the build-breaker isn’t responding, perform the revert yourself.

  3. Watch the next build to make sure it completes cleanly.

    • Sheriffs are responsible for watching builds and making sure that people are working on making them green.

    • If there's any red on the dashboard, the Sheriffs should be watching it closely to make sure that the issues are being fixed.  If there’s not, the Sheriffs should be working to improve the sheriffability of the tree

What bugs do I file, and how do I assign them?

  • If a test fails, or a specific component fails to compile, file a bug against that component normally. Assign it to an owner for that component.

  • If you believe there's an issue with the builder infrastructure, contact the oncall as above or file a bug at go/cros-ci-bug.

  • If you believe there's an issue with the lab or hardware test infrastructure, you can file a bug following the instructions at go/chromeos-lab-bug

Ahh, the tree is green.  I can go back to my work, right?

Wrong!  When the tree is green, it’s a great time to start investigating and fixing all the niggling things that make it hard to sheriff the tree.

  • Is there some red-herring of an error message?  Grep the source tree to find out where it’s coming from and make it go away.

  • Some nice-to-have piece of info to surface in the UI?  Talk to Infrastructure Deputy and figure out how to make that happen.

  • Some test that you wish was written and run in suite:smoke?  Go write it!

  • Has the tree been red a lot today?  Get started on your postmortem!

  • Still looking for stuff to do?  Try flipping through the Gardening Tasks.  Feel free to hack on and take any of them!

  • Run across a build-related term or acronym and didn't know what it meant? Check the glossary and add it if it's missing.

What should I do if I see a commit-queue run that I know is doomed?

If there is a commit queue run that is known to be doomed due to a bad CL and you know which CL is bad, you can manually blame the bad CL to spare the innocent CLs from being rejected. Go to the Gerrit review page of the bad CL and set verify to -1. CQ will recognize the gesture and reject only the bad CL(s).


If the commit queue run has encountered infrastructure flake and is doomed, most of the time CQ will not reject any CLs (with chromite CLs being the exception).


In other cases where it is not necessary to wait for the full results of the run, you can save developer time and hassle by aborting the current CQ run.

How do I deal with a broken Chrome?

How can I revert a commit?

If you've found a commit that broke the build, you can revert it using these steps:

  1. Find the Gerrit link for the change in question

    • Gerrit lists recently merged changes. View the change in question.

    • In some cases (like the PFQ), you can see links to the CL that made it into the build straight from the waterfall.

  2. Click the “Revert Change” button and fill in the dialog box.

    • The dialog box includes some stock info, but you should add on to it. E.g. append a sentence such as: This broke the tree (xxx package on xxx bot).

  3. You are not done.  This has created an Open change for _you_ go find and submit it.

    • Add the author of the change as a reviewer of the revert, but push without an LGTM from the reviewer (just approve yourself and push).

Help with specific failure categories

How do I investigate VMTest failures?

A buildbot slave appears to be down (purple). What do I do?

https://yaqs.googleplex.com/eng/q/4533245485776896

platform_ToolchainOptions autotest is failing. What do I do?

https://yaqs.googleplex.com/eng/q/6727807679594496

Who should I contact regarding ARC++ issues?

How are sheriff rotations scheduled? How do I make changes? 

What do I do when I see a NotEnoughDutsError? 

Tips

You can setup specific settings in the buildbot frontend so that you are watching what you want and also having it refresh as you want it.


Navigate to the buildbot frontend you are watching and click on customize, in the upper left corner. From there you can select what builds to watch and how long between refreshes you want to wait.


If looking at the logs in your browser is painful, you can actually vim the log's URL and vim will wget it and put you in a vim session to view it.

Other handy links to information: