As mentioned on the GPU testing page, the GPU bots use a new framework from Chrome's infrastructure team called recipes. The previous infrastructure sent commands from the buildbot master to the various build machines. In comparison, recipes delegate most of the responsibility of deciding how to compile the code, run tests, etc. to the machine doing the build. Compared to the legacy buildbot scripts, recipes vastly simplify the tasks of modifying the bots' configuration, adding new steps, and locally testing changes to the bots. They virtually eliminate waterfall restarts when making changes to the bots.
This page describes the GPU recipe, how it's configured on the various bots, and how to modify and test it locally.
The GPU recipe is run on almost all of the bots on the chromium.gpu waterfall (all but the Android bot, as of this writing), the GPU bots on the chromium.webkit waterfall, and the GPU bots on the chromium.gpu.fyi waterfall. All of the bots on these waterfalls are split into builders and testers. The builders compile the code, upload the build results, and trigger the testers. The testers download the builds and run the tests.
Portions of the GPU recipe, which describe the tests to be run, are also run on the Chromium tryservers. The GPU trybots which previously lived on their own waterfall have been added to the swarming pool, and are now triggered from the regular Chromium and Blink tryservers. See the GPU testing page for more details on which bot configurations run the GPU tests.
On the GPU bots, the binaries are sent from builders to testers using isolates. An isolate contains the binary as well as any dependent libraries or data files. The GPU testers do not check out the Chromium workspace; they receive all of the binaries, test harnesses and data files in the isolates coming from the builders. The high-level point is that when adding any new tests to the GPU bots, they must be made to work with isolates. The Release builders build the static library build; the Debug builders build the component build, to speed up linking. Isolates work with both flavors.
There are a couple of differences between the GPU tryservers and the waterfall bots. First, the tryservers have been changed to use the swarming infrastructure for better resource utilization, so they run the regular Chromium recipe rather than the GPU recipe. (This difference will ideally be eliminated in the future.) Second, when running the pixel tests, they expect to download a reference image from cloud storage, rather than potentially uploading one to cloud storage. The reason for this behavior is that a bad patch may cause the try server to produce a bad image, so the try servers' results can not be trusted. Because the try servers rely on the other waterfalls to produce their reference images, there must be at least one bot with the same GPU and operating system configuration on the main waterfall (e.g. chromium.gpu) for each such configuration on the try servers.
The GPU recipe lives in the tools workspace under tools/build/. Here is a .gclient for fetching the sources (only developers working at Google can fetch the internal sources; I don't know whether the recipe will run without them):
The GPU recipes themselves live in tools/build/scripts/slave/recipes/gpu/. There are three recipes: build_and_upload.py, download_and_test.py, and build_and_test.py. All of the GPU bots currently use either the build_and_upload or download_and_test recipes. build_and_test was previously used on some bots, but currently is used only for local testing.
The recipes themselves are short; the bulk of the logic is factored into modules. Here is, for example, the entire code for the build_and_upload recipe:
DEPS = [
The GPU recipe module lives in tools/build/scripts/slave/recipe_modules/gpu/ alongside the other recipe modules. api.py contains most of the logic for the GPU bots. The more significant logic includes:
Testing recipes locally is easier than testing buildbot script changes. It's not necessary to run buildbot locally and trigger builds by hand. run_recipe.py executes the recipe in the same way it is run on the bot. Command line arguments easily change the behavior of the recipe.
When running the "builder" recipes (build_and_upload, build_and_test), a separate checkout of the entire Chromium source tree is made into the tools/build workspace. This takes a fair amount of time the first time. It is recommended to use a git checkout for local testing. More information below on how this is configured.
As the GPU recipes have evolved, the number of command line arguments required in order to execute them properly has increased. Normally buildbot specifies these arguments, but "fake" values which are good enough can be used in order to test changes to the recipes.
Unfortunately, multiple dependencies prevent anyone except Google employees from running the GPU recipe effectively. If you are a non-Google Chromium contributor and wish to make contributions and run the recipe locally, please file a bug at crbug.com/new with the label Cr-Internals-GPU-Testing.
The Chromium project's distributed build system, Goma, is currently required to run the GPU recipe, which unfortunately limits its direct execution to Google employees. Visit go/ma for setup instructions. Goma must be installed into tools/build/goma/, so that tools/build/goma/goma_ctl.sh exists on disk. When running the recipe locally, other instances of the compiler proxy should be stopped.
As of this writing, the GPU recipe requires credentials for two services: namely, the isolate server and cloud storage.
Release builds via the GPU recipe automatically upload their results to the isolate server, so it is required to first authenticate to it. From a Chromium checkout, run:
This will open a web browser to complete the authentication flow. A @google.com email address is required in order to properly authenticate.
To test your authentication, find a hash for a recent isolate. For example, go to a recent build on Linux Release (NVIDIA), go to the setup_build step, search for the "swarm_hashes" property, and take a random hash from one of the targets like content_gl_tests. Then run the following:
If authentication succeeded, this will silently download a file called "delete_me" into the current working directory. If it failed, the script will report multiple authentication errors. In this case, use the following command to log out and then try again:
Authentication to Google Cloud Storage is needed for a couple of reasons: uploading pixel test results to the cloud, and potentially uploading and downloading builds as well, at least in Debug mode. Use the copy of gsutil in depot_tools/third_party/gsutil/gsutil, and follow the Google Cloud Storage instructions to authenticate. You must use your @google.com email address and be a member of the Chrome GPU team in order to receive read-write access to the appropriate cloud storage buckets. Roughly:
Navigate to https://storage.cloud.google.com/?arg=chromium-gpu-archive to view the contents of the cloud storage bucket.
The GPU recipes live in tools/build/scripts/slave/recipes/gpu. As of this writing there are three main recipes: build_and_upload.py, download_and_test.py, and build_and_test.py. All of the GPU bots on all of the waterfalls are running either the build_and_upload or download_and_test recipes. build_and_test is at this point mainly used for local testing, though since the introduction of isolates, download_and_test is much easier to use for that purpose.
Once all authentication is complete, the recipes can be run. Here is an example invocation of the build_and_test recipe:
This will run the recipe and put all of its output into "recipe_output.txt" in the current working directory. You can watch its progress in another terminal by running 'tail -f recipe_output.txt'. It is strongly recommended to capture the recipe's entire output when running it locally so that you can easily search back for unexpected failures.
Throughout, it is recommended to replace "myname" with your login, so in case you write results to cloud storage, it can be easily identified who wrote them.
Some notes on the command line arguments:
The build_and_test and build_and_upload recipes require the same basic command line arguments.
There are some other arguments which may be useful for local testing. skip_checkout can be used to skip the whole-workspace "gclient sync" operation which usually triggers large rebuilds. skip_compile can be used to skip the compile, reusing the last run's binaries. See the build_and_test recipe for
The download_and_test recipe requires additional arguments, because ordinarily the testers are triggered by the parent builder machines.
Example invocation on Linux:
Example invocation on Windows:
Replace "myname" everywhere with your username to disambiguate any results which might be uploaded to the flakiness dashboard inadvertently.
Notes on the command line arguments:
The easiest way to see how tests are invoked on the bots is to build isolates out of your own Chromium workspace, upload them to the isolate server, and then run the download_and_test recipe, passing the isolates' hashes in the swarm_hashes property. To do this:
As described in the documentation for recipes, the primary way recipes are tested is to record the output of what commands they would have executed. Any changes to the recipe which affect the steps in any way, including the command line arguments that might be passed to any command or which steps are executed, require the recipe's expectations to be retrained. You will discover this if you attempt to git cl upload a change to the recipe and the presubmit checks fail with output like
To retrain the recipes' expectations, run:
This must be done on Linux only as of this writing. Otherwise, extraneous differences in paths will show up.
Then carefully examine the changed files. Make sure the differences in the commands are what you expect. This is the primary line of defense against breaking the bots.
Recipes require 100% code coverage. It is not allowed to add a conditional to a recipe without tests that exercise both branches. For this reason, if you add a conditional, a new recipe module, or a new API to an existing module, it is very likely that you will need to either add a new test for it, or modify an existing one.
Study the GenTests() methods in the build_and_test, build_and_upload, and download_and_test recipes. These should give an idea of the form of the tests, and the situations where new tests are needed in order to provide 100% code coverage. See for example:
As of this writing, it is unfortunately not possible to send try jobs of the GPU recipe for actual execution on the GPU try servers. This means that significant changes to the recipe must be handed very carefully. Always file a bug about changes to the GPU recipe, and point the BUG= line to it in associated CLs. Doing so will yield a clear timeline of all commits and reverts associated with a change to the recipe. Be prepared to use drover or "git revert" to roll back changes to the recipe which introduce breakage on the bots. Always provide at least a brief description of the reason for the revert in the CL, and provide more detail in the bug report, including excerpts of logs. (The links to logs expire after only a short time.) Do not let bots stay red for an extended period of time while issues with the recipe are being fixed.
It's straightforward to add new steps to the recipe. Follow the patterns in tools/build/scripts/slave/recipe_modules/gpu/api.py for either a new build step or a new test step.
All new tests running on the tryservers and main waterfall bots (chromium.gpu, chromium.webkit) must be open-source. Please see the Chromium testing guidelines for details on this policy. If it's simply impossible to open-source the test it is possible that it can be run on the chromium.gpu.fyi waterfall, but a better approach would be to create an open-source version of the test.
All new tests must be able to be run via isolates. If you are adding a new binary (unlikely), you need to add a new .isolate file in src/chrome/, and a new _run target to src/chrome/chrome_tests.gypi. Then add your isolate's name to the list in tools/build/scripts/slave/recipe_modules/gpu/common.py. If you're adding a new Telemetry based test (both likely and hopefully), it is likely that your new test or data files will already be covered by either telemetry.isolate or telemetry_gpu_test.isolate. Adjust the isolates as necessary. Create a new one if absolutely necessary.
Build and run your isolate locally before attempting to add it to the GPU recipe. See the subsection above entitled "Testing your own isolates with the download_and_test recipe" for instructions on setting up the needed GYP_DEFINES to build and upload your isolate to the isolate server. To run it locally, run src/tools/swarming_client/run_isolated.py with the appropriate arguments. For simple isolates (i.e., non-Telemetry based ones):
The telemetry-based GPU tests currently use the same isolate for all the tests. In this case the invocation looks like (for example):
If you are adding a new build step, run the build_and_upload recipe locally to make sure it works.
If you are adding a new test step, it is recommended to first build its associated isolate out of your (separate) Chromium workspace and upload that to the isolate server. Then run the download_and_test step locally, passing the hash of your local build's isolate in the swarm_hashes dictionary. Copy the rest of the hashes from a recent build on one of the Release GPU bots running the same OS as your local machine.
Because currently it isn't possible to send try jobs of the recipe itself (see the section above), if you are adding a new test step, it is strongly recommended to:
When you commit your change to the recipe:
Note also that changes to the recipe might be seen for the first time on the testers rather than the builders. If you add a new binary, you might find that the testers fail during the first execution of that recipe change, unable to find the isolate for the new binary. You could work around this by committing your recipe changes in two stages: the first which adds the compilation of the new binary, and the second which adds its execution, waiting for your changes to propagate through the waterfalls in between. Or, if this is the issue, just wait for a second build and see if the problem clears up. If not, revert.
Again, there is currently no support for try jobs of the recipe itself. Be careful when making changes!