Deterministic builds

Summary

Switching Chromium's build process to become deterministic will have a multiplicative effect on the Test isolation and Swarming projects which aim to save a large amount of the redundant testing that is currently done. 

I estimate we'd save around 30%~50% of the current testing load and largely reduce the testing headaches we currently suffer. 

This will result in largely faster test cycles on both the Try Server (TS) and the Continuous Integration (CI) infrastructure, in addition to the current latency saving both Test isolation and Swarming and provide.

Background

Test isolation is an on-going effort to generate an exact list of the files needed by a given unit test at runtime. It is usually the test executable itself and its test data. In particular, it excludes C++ sources and neighbouring test executables. An isolated test is addressed by the SHA-1 of all its test data files and executables but not its source files. The test isolation effort includes very efficient file archival and retrieval by using content addressed storage. It results in slightly higher utilization overall to archive and retrieve results.

Swarming is an on-going effort to use Test isolation to run tests simultaneously to reduce latency in getting test results. It results in slightly higher utilization overall as a trade-off to get test results faster by using the horizontal scaling enabled by Test isolation.

Normal projects do not have deterministic builds and chromium is one example. A deterministic build is not something that happens naturally. It needs to be done. Up to now, a few devs have toyed with creating a deterministic build and a content addressed build system but the gains were mild and the impact negligible. This is changing. The 5 slowest tests are being run on Swarming and have support for Test isolation. We want to increase it to the top 12 slowest tests within 6 months. Swarming knows the relationship between an isolated test and the result when run on a bot with specific features. The specific features are determined by the requests. For example the request may specify bot OS specific version, bitness, metadata like the video card, etc. TODO: link.

Google internally uses many tricks to achieve similar performance improves at extremely high cache rates: [link]

Building the whole action graph would be wasteful (...), so we will skip executing an action unless one or more of its input files change compared to the previous build. In order to do that we keep track of the content digest of each input file whenever we execute an action. As we mentioned in the previous blog post, we keep track of the content digest of source files and we use the same content digest to track changes to files.

Project Goal: Dramatically decreased test latency

If you run a task on Swarming and they key is the SHA-1 of the runtimes files to run the test and the test was already run, Swarming returns the cached results immediately. In addition, binaries do not need to be archived if they are already in the isolate server cache. So it's actually a 2x multiplier here for each target that becomes deterministic and runs on Swarming. Benefits are both monetary (less hardware is required) and developer time (reduced latency by having less work to do on the TS and CI).

Example workflow on Windows

echo "" >> foo_main.cc - Do a whitespace change foo.cc to force a compilation.
compile foo_main.cc -> foo_main.obj - It recreates the same foo.obj than before since the code didn't change.
link foo_main.obj -> bar_test.exe - This step could be saved by a content-addressed build system, see "extension of the project" below.
isolate bar_test.exe -> isolateserver - The binary didn't change, it is not uploaded again.
swarming.pu run <sha1 of bar_test.isolated> - Swarming immediately return results of the last green build.

Incremental progress

  1. Getting Swarming task names to be content addressed. This is a technical change so the SHA-1 of the .isolated file + OS generates the key to trigger and get results for a unit test. This relationship is cached and returned when done redundantly. Relatively simple but there was no driving force to complete this change yet.
  2. Getting .isolated to be deterministic. Done.
  3. Getting test binary to be deterministic. This is the big task at hand here.

Non goals

Make the build deterministic is not a goal in these conditions:
  • when the absolute path of the base directory is different, e.g. the build will only be deterministic when built from the exact base path. This is simple to do on build slaves, harder for users.
    • This is a problem since build slaves by default embed the builder name in the path, this can be fixed.
  • for developers. The primary use case is the CI (continuous integration tests) and TS (Try Server, pre-commit tests).
  • Content addressed build. A CAB can be built out of a deterministic build but this is not what this effort is about.

Short term non-goals

These are things that will be explicitly not be looked at initially:
  • Make the build deterministic on Visual Studio. The fact the toolset is not open source affects our capability to make it work.
  • Make the build deterministic on official builds.
  • Support for debug symbols.

Challenges

Each toolset is non-deterministic in different ways so the work has to be redone on each platforms. Some items in the build process are non-deterministic by design. On the other hand, preliminary work has been done notably by chrisha@ on Windows to create a tool to remove non-determinism in Visual Studio toolset, so we are not starting from scratch and there is expertise in the team to achieve this goal. Automatically catching build determinism regression will require follow-up work to be automated and should be included in the head count and project duration planning.

Non-determinism comes from various sources. Here's the principal categories:

Non-determinism originating from the code itself

  • File paths: direct or indirect embedding of non-deterministic source file paths in the final binary; for example use of C/C++ macro __FILE__ with the use of absolute file paths instead of relative file paths.
  • File content references; for example use of C/C++ macro __LINE__, __COUNTER__.
  • Timestamps: for example, use of C/C++ macros __DATE__, __TIME__, __TIMESTAMP__, embedding the compilation time in the binary, etc.
  • Source control revision numbers embedded in the binary. That fact the SCM reference changed doesn't mean the content changed and as such shouldn't affect the final binary, except extraneous metadata.

Toolset induced non-determinism

  • Build system non-determinism, this buckets non determinism induced by GYP, ninja or GN and not the underlying toolset.
  • Objects file ordering in the final executable and order of dynamic library listing.
  • Multithreading in the toolset; multi-threaded linking. Possible issue, not confirmed.
  • Caused by the toolset on purpose: GUIDs embedded in debug symbols (PDB), timestamps in linked executable. See zap_timestamp as a tool to work around Visual Studio limitations.
  • Library induced; possible issue, not confirmed.
  • Between toolsets: the build determinism can only be achieved by using the exact same toolset; exact same compiler, linker and ancillary tools. Fighting this is a waste of time, simply stating it here to clarify.

Windows specific

  • The manifest XML generated in the PE is not deterministic, its textual format occasionally changes even if the logical content is the same.
  • Path casing.
  • Objects files are linked by the linker in what seems like a hash table, so changing something as simple as path casing can completely shuffle the order.
  • In LTCG/WPO, the linker is dual threaded, so we think it'll be less deterministic.

OSX specific

  • TODO(maruel): Investigate OSX specific issues which will likely come up. clang supports zapping out __FILE__ with /D__FILE__=foo.

Android specific

  • TODO(maruel): Investigate. Java is definitely going to add more non-determinism. On the other hand, it's high value.

Extension of the project

Getting the build system to be content addressed as described in the Google reference above. This is a secondary task. It is out of scope for now but is a natural extension of the project. This saves on the build process itself at the cost of calculating hashes for each intermediary files. This will be piling on the fact that the build is itself deterministic. This is not worth doing before the build is itself deterministic and this property is enforced. This will save significantly on the build time itself on the build machines.
Comments