Make Chromium's build process to be deterministic. Tracking issue: crbug.com/314403
Improve cycle time and reduce infrastructure utilization with redundant tasks.
- Reduce I/O load on the Isolate Server by having deterministic binaries.
- Binaries do not need to be archived if they are already in the Isolate Server cache. If the build is deterministic, this happens frequently.
- Take advantage of native's Swarming task deduplication by skipping redundant test executables.
- If an .isolated task was already run on Swarming with the exact same SHA-1 and it succeeded, Swarming returns the cached results immediately.
So it's actually a 2x multiplier here for each target that becomes deterministic and runs on Swarming. Benefits are both monetary (less hardware is required) and developer time (reduced latency by having less work to do on the TS and CI).
We estimate we'd save around >20% of the current testing load. This will result in faster test cycles on the Try Server (TS) and the Continuous Integration (CI) infrastructure. Swarming already dedupe 1~7% of the tasks runtime simply due to incremental builds.
Test isolation is an on-going effort to generate an exact list of the files needed by a given unit test at runtime. It enables 3 benefits:
- Horizontal scaling by splitting a slow test into multiple shards run concurrently.
- Horizontal scaling by running multiple tests concurrently on multiple bots.
- Multi-OS testing of the same binaries. For example, build once, test on XP, Vista, Win7, Win8, Win10.
Swarming is the task distributor that leverage Test isolation to run tests simultaneously to reduce latency in getting test results.
Normal projects do not have deterministic builds and chromium is one example. A deterministic build is not something that happens naturally. It needs to be done. Swarming knows the relationship between an isolated test and the result when run on a bot with specific features. The specific features are determined by the requests. For example the request may specify bot OS specific version, bitness, metadata like the video card, etc.
Google internally uses many tricks to achieve similar performance improves at extremely high cache rates: [link
Building the whole action graph would be wasteful (...), so we will skip executing an action unless one or more of its input files change compared to the previous build. In order to do that we keep track of the content digest of each input file whenever we execute an action. As we mentioned in the previous blog post, we keep track of the content digest of source files and we use the same content digest to track changes to files.
Make the build deterministic is not a goal in these conditions:
- When the absolute path of the base directory is different, e.g. the build will only be deterministic when built from the exact base path. This is simple to do on build slaves, harder for users.
- This is a problem since build slaves by default embed the builder name in the path, this can be fixed.
- For developers. The primary use case is the CI (continuous integration tests) and TS (Try Server, pre-commit tests).
- Content addressed build. A CAB can be built out of a deterministic build but this is not what this effort is about.
- Make the build deterministic on official builds, at least not in the short term. Reproducible build does have security benefits so it can be considered an eventual extension of the project.
Testing is currently with the following pattern:
ninja -C out/Release all
mv out out.1
ninja -C out/Release all
compare_build_artifacts.py -f out -s out.1
Regressions are tracked via a whitelist json file. It is not mandatory yet. It will eventually be.
Non-determinism comes from various sources. Here's the principal categories:
Non-determinism originating from the code base itself
- File paths: direct or indirect embedding of non-deterministic source file paths in the final binary; for example use of C/C++ macro __FILE__ with the use of absolute file paths instead of relative file paths.
- File content references; for example use of C/C++ macro __LINE__, __COUNTER__.
- Timestamps: for example, use of C/C++ macros __DATE__, __TIME__, __TIMESTAMP__, embedding the compilation time in the binary, etc.
- Source control metadata: checkout revision number embedded in the binary. That fact the SCM reference changed doesn't mean the content changed and as such shouldn't affect the final binary, except extraneous metadata.
- Build tools: build scripts enumerating a hash table (python dict) without sorting to operate on build outputs.
Toolset and infrastructure induced non-determinism
- Build system non-determinism: this buckets non determinism induced by GYP, ninja or GN and not the underlying toolset nor the source code itself.
- Object files ordering in the final executable and order of dynamic library listing. This can occur in any step or tool working on object files.
- Multithreading in the toolset; multi-threaded linking.
- Caused by the toolset on purpose: GUIDs embedded in debug symbols (PDB), timestamps in linked executable. See zap_timestamp as a tool to work around Visual Studio limitations.
- Between toolsets: the build determinism can only be achieved by using the exact same toolset; exact same compiler, linker and ancillary tools. Fighting this is a waste of time, simply stating it here to clarify.
- Infrastructure: Local versus remote compilation may (and usually) generate different object files.
OS Specific Challenges
Each toolset is non-deterministic in different ways so the work has to be redone on each platforms.
- Issues seen:
- The manifest XML generated in the PE is not deterministic, its textual format occasionally changes even if the logical content is the same.
- Path casing.
- Objects files are linked by the linker in what seems like a hash table, so changing something as simple as path casing can completely shuffle the order.
- In LTCG/WPO, the linker is dual threaded, so we think it'll be less deterministic.
- zap_timestamp took care of many issues. The main blockers are x64 and PDBs.
- The toolset is heavily non-deterministic.
- The first step is ZERO_AR_DATE=1 which was reverted due to iOS breakage.
- Even though, it's not sufficient, someone needs close look at Mach-O format.
Example workflow on Windows
# Do a whitespace change foo.cc to force a compilation.
echo "" >> foo_main.cc
# It recreates the same foo.obj than before since the code didn't change.
compile foo_main.cc -> foo_main.obj
# This step could be saved by a content-addressed build system, see "extension of the project" below.
link foo_main.obj -> bar_test.exe
# The binary didn't change, it is not uploaded again, saving both I/O and latency.
isolate bar_test.exe -> isolateserver
# Swarming immediately return results of the last green build, saving both utilization and latency.
swarming.py run <sha1 of bar_test.isolated>
Extension of the project
Getting the build system to be content addressed as described in the Google reference above. This is a secondary task. It is out of scope for this project but is a natural extension. This saves on the build process itself at the cost of calculating hashes for each intermediary files. This will be piling on the fact that the build is itself deterministic. This is not worth doing before the build is itself deterministic and this property is enforced. This will save significantly on the build time itself on the build machines.