GPU Pixel WranglingGPU Pixel Wrangling is the process of keeping various GPU bots green. On the GPU bots, tests run on physical hardware with real GPUs, not in VMs like the majority of the bots on the Chromium waterfall.
WaterfallsThe waterfalls work much like any other; see the Tour of the Chromium Buildbot Waterfall for a more detailed explanation of how this is laid out. We have more subtle configurations because the GPU matters, not just the OS and release v. debug. Hence we have Windows Nvidia Release bots, Mac Intel Debug bots, and so on.The waterfalls we’re interested in are:
- Chromium GPU [http://build.chromium.org/p/chromium.gpu/waterfall?reload=120] [console view]
- Various operating systems, configurations, GPUs, etc.
- WebKit GPU [http://build.chromium.org/p/chromium.webkit/waterfall?reload=120] [console view]:
- Uses the tip-of-tree WebKit instead of the version rolled into Chromium, better for finding regressions in WebKit
- This waterfall has many non-GPU bots. Views of only the GPU bots: [waterfall view] [console view]
- The GPU tryservers
- These bots run try jobs from "git cl try" or the Rietveld UI.
- The GPU pixel wrangler needs to check that these bots are online, are running jobs correctly, and aren't overloaded.
- Of course, try jobs may occasionally fail due to bad patches. This is normal.
- The try servers are in the process of being transitioned to swarming, so the process of monitoring them is changing.
- See the section below on making sure the try servers are in good health.
- [Builder view of the tryserver.chromium.gpu waterfall]
- To a lesser degree: the Chromium GPU FYI waterfall [http://build.chromium.org/p/chromium.gpu.fyi/waterfall?reload=120][console view]
- These bots run less-standard configurations like Windows XP, Linux with Intel GPUs, etc.
- There are some longstanding failures on this waterfall. The Windows XP, Linux Intel, and Linux AMD bots have been red for a long time. Don't worry about these failures.
- These bots build with top of tree ANGLE rather than the DEPS version. This means that the tree can go red, with no Chromium commit to blame, if commits to the ANGLE repository break compilation with Chromium. It is advisable to keep this waterfall green, and not to let it go red for long periods of time (12-24 hours).
- Ignore the Win7 Audio and Linux Audio bots. They are not maintained by the Chrome GPU team and are being decommissioned.
As an alternative to constantly watching the various waterfall pages listed above, check out the check_gpu_bots.py script which can be configured to repeat periodically and email you if there's something that needs to be looked into.
The bots run several test suites. The majority of them have been migrated to the Telemetry harness, and are run within the full browser, in order to better test the code that is actually shipped. As of this writing, the tests included:
Additionally, the Release bots run:
- Tests using the Telemetry harness:
- The WebGL conformance tests: src/content/test/gpu/gpu_tests/webgl_conformance.py
- A Google Maps test: src/content/test/gpu/gpu_tests/maps.py
- Context loss tests: src/content/test/gpu/gpu_tests/context_lost.py
- GPU process launch tests: src/content/test/gpu/gpu_tests/gpu_process.py
- Hardware acceleration validation tests: src/content/test/gpu/gpu_tests/hardware_accelerated_feature.py
- GPU memory consumption tests: src/content/test/gpu/gpu_tests/memory.py
- Pixel tests validating the end-to-end rendering pipeline: src/content/test/gpu/gpu_tests/pixel.py
- content_gl_tests: see src/content/content_tests.gypi
- gles2_conform_test (requires internal sources): see src/gpu/gles2_conform_support/gles2_conform_test.gyp
- gl_tests: see src/gpu/gpu.gyp
- angle_unittests: see src/gpu/gpu.gyp
- tab_capture_performance_tests: see performance_browser_tests in src/chrome/chrome_tests.gypi and src/browser/extensions/api/tab_capture/tab_capture_performancetest.cc
More details about the bots' setup can be found on the GPU Testing page.
- Ideally a wrangler should be both committers on both the Blink and Chromium projects. If you're on the GPU pixel wrangling rotation, there will be an email notifying you of the upcoming shift, and a calendar appointment.
Apply for access to the bots.
- If you aren't a committer, don't panic. It's still best for everyone on the team to become acquainted with the procedures of maintaining the GPU bots.
- In this case you'll upload CLs to Rietveld to perform reverts (optionally using the new "Revert" button in the UI), and might consider using TBR= to speed through trivial and urgent CLs. In general, try to send all CLs through the commit queue.
- Contact bajones, kbr, vangelis, zmo, or another member of the Chrome GPU team who's already a committer for help landing patches or reverts during your shift.
How to Keep the Bots Green
- Watch for redness on the tree.
- The bots are expected to be green all the time. Flakiness on these bots is neither expected nor acceptable.
- If a bot goes consistently red, it's necessary to figure out whether a recent CL caused it, or whether it's a problem with the bot or infrastructure.
- If it looks like a problem with the bot (deep problems like failing to check out the sources, the isolate server failing, etc.) notify the Chromium troopers. See the general tree sheriffing page for more details.
- Otherwise, examine the builds just before and after the redness was introduced. Look at the revisions in the builds before and after the failure was introduced. Depending on whether you're looking at the Chromium or Blink trees, use either the Chromium or Blink revisions. Unfortunately, you'll need to construct your regression URL manually:
- For regressions on the Chromium tree: use this URL and replace "[rev1]" and "[rev2]" in the "range=[rev1]:[rev2]" URL query parameter
- For regressions on the Blink tree: use this URL and replace "[rev1]" and "[rev2]" in the "range=[rev1]:[rev2]" URL query parameter
- File a bug capturing the regression range and excerpts of any associated logs. Regressions should be marked P1. CC engineers who you think may be able to help triage the issue. Keep in mind that the logs on the bots expire after a few days, so make sure to add copies of relevant logs to the bug report.
- Study the regression range carefully. Changes outside the Chromium tree (i.e., in /trunk/tools/ rather than /trunk/src/) may break the GPU bots, because the GPU recipe which drives these bots is in the tools repository.
- Use drover to revert any CLs which break the GPU bots. In the revert message, provide a clear description of what broke, links to failing builds, and excerpts of the failure logs, because the build logs expire after a few days.
- Make sure the bots are running jobs.
- Keep an eye on the console views of the various bots.
- Make sure the bots are all actively processing jobs. If they go offline for a long period of time, the "summary bubble" at the top may still be green, but the column in the console view will be gray.
- Email the Chromium troopers if you find a bot that's not processing jobs.
- Make sure the GPU try servers are in good health.
- The GPU try servers have been transitioned to the swarming infrastructure used by the other Chromium trybots. They are no longer distinct bots on a separate waterfall, but instead run as part of the regular tryjobs on the Chromium and Blink waterfalls. The GPU tests run as part of the following tryservers' jobs:
- linux_chromium_rel_ng on the tryserver.chromium.linux waterfall
- mac_chromium_rel_ng on the tryserver.chromium.mac waterfall
- win_chromium_rel_ng on the tryserver.chromium.win waterfall
- linux_blink_rel, mac_blink_rel and win_blink_rel on the tryserver.blink waterfall
- The best tool to use to quickly find flakiness on the tryservers is the new Chromium Try Flakes tool. Look for the names of GPU tests (like maps_pixel_test) as well as the test machines (e.g. mac_chromium_rel_ng and mac_blink_rel). If you see a flaky test, file a bug like this one. Also look for compile flakes that may indicate that a bot needs to be clobbered. Contact the Chromium sheriffs or troopers if so.
- The Swarming Server Stats tool provides an overview of the health of these bots. Use the "gpu" drop-down to go through the supported GPU types and select the resulting dimension corresponding to one of the bots. (The Windows and Linux bots use the same GPU; it's best to examine them independently.) Check the activity on these bots to ensure the number of pending jobs seems reasonable according to historical levels. Sign in with an @google.com account in order to examine individual bots' history and see the successes, failures and durations of tests on the bot.
- For more in-depth detail, examine the specific bots above. See if there are any pervasive build or test failures. Note that test failures are expected on these bots: individuals' patches may fail to apply, fail to compile, or break various tests. Look specifically for patterns in the failures. It isn't necessary to spend a lot of time investigating each individual failure. (Use the "Show: 200" link at the bottom of the page to see more history.)
- If the same set of tests are failing repeatedly, look at the individual runs. Examine the swarming results and see whether they're all running on the same machine. If they are, something might be wrong with the hardware. Use the Swarming Server Stats tool to drill down into the specific builder.
- If you see the same test failing in a flaky manner across multiple machines and multiple CLs, it's crucial to investigate why it's happening. crbug.com/395914 was a recent example of an innocent-looking Blink change which made it through the commit queue and introduced widespread flakiness in a range of GPU tests. The failures were also most visible on the try servers as opposed to the main waterfalls.
- Use Chrome Monitor to see if any of the tryservers seem to be falling far behind (hundreds of jobs queued up). If so, email the Chromium troopers for help. Try to correlate the data with the swarming server stats to see whether the GPU tryservers have fallen behind.
- chrome-monitor for linux_chromium_rel_ng
- chrome-monitor for mac_chromium_rel_ng
- chrome-monitor for win_chromium_rel_ng
- chrome-monitor for linux_blink_rel
- chrome-monitor for mac_blink_rel
- chrome-monitor for win_blink_rel
- Check if any pixel test failures are actual failures or need to be rebaselined.
- For a given build failing the pixel tests, click the "stdio" link of the "pixel" step.
- The output will contain a link of the form http://chromium-browser-gpu-tests.commondatastorage.googleapis.com/view_test_results.html?242523_Linux_Release_Intel__telemetry
- Visit the link to see whether the generated or reference images look incorrect.
- All of the reference images for all of the bots are stored in cloud storage under the link https://cloud.google.com/console#/storage/chromium-gpu-archive/reference-images/ . They are indexed by version number, OS, GPU vendor, GPU device, and whether or not antialiasing is enabled in that configuration. You can download the reference images individually to examine them in detail.
- Rebaseline pixel test reference images if necessary.
- Increment the revision number of the particular test in src/content/test/gpu/page_sets/pixel_tests.json .
- When this is committed, all of the bots will generate new reference images for the new version of the test.
- Alternatively, if absolutely necessary, you can use the Chrome Internal GPU Pixel Wrangling Instructions to delete just the broken reference images for a particular configuration.
- Update Telemetry-based test expectations if necessary.
- Most of the GPU tests are run inside a full Chromium browser, launched by Telemetry, rather than a Gtest harness. The tests and their expectations are contained in src/content/test/gpu/gpu_tests/ . See for example webgl_conformance_expectations.py , gpu_process_expectations.py and pixel_expectations.py.
- See the header of the file a list of modifiers to specify a bot configuration. It is possible to specify OS (down to a specific version, say, Windows 7 or Mountain Lion), GPU vendor (NVIDIA/AMD/Intel), and a specific GPU device.
- The key is to maintain the highest coverage: if you have to disable a test, disable it only on the specific configurations it's failing. Note that it is not possible to discern between Debug and Release configurations.
- Mark tests failing or skipped, which will suppress flaky failures, only as a last resort. It is only really necessary to suppress failures that are showing up on the GPU tryservers, since failing tests no longer close the Chromium tree.
- Please read the section on stamping out flakiness for motivation on how important it is to eliminate flakiness rather than hiding it.
- For the remaining Gtest-style tests, use the modifiers DISABLE_ / FLAKY_ / etc. to suppress or disable any tests if necessary.
- (Rarely) Update the version of the WebGL conformance tests. See below.
When Bots Misbehave (SSHing into a bot)
Updating the WebGL Conformance Tests
Occasionally a bug in the WebGL conformance tests will be exposed by a WebKit roll, and the best solution is to roll forward to a new version of the WebGL conformance suite in which the bug has been fixed. In order to do this, follow the steps below.
- Visit https://chromium.googlesource.com/external/khronosgroup/webgl.git with your browser.
- Find the full git hash of the revision you want to roll to.
- Modify the entry for src/third_party/webgl/src in src/DEPS with the new hash.
- Send the CL to the GPU try servers (win_gpu, linux_gpu, mac_gpu).
- If the CL looks good, commit it.
- Watch the GPU bots on the various waterfalls. There are more OS and GPU combinations than the GPU servers can reasonably try. An update to the WebGL conformance suite is likely to fail on one or more bots.
- Update the WebGL conformance suite expectations to suppress failures if necessary. File bugs about the need for these suppressions so they can be removed in the future.
Extending the GPU Pixel Wrangling Rotation