Enables running tests outside their checkout by describing the unit test's dependency efficiently. The dependency list is generated using the Tracing infrastructure.
The Chromium waterfall current use a set of "builders" and testers". The "testers" need a full checkout to run the test since we currently have no idea which files are read by which tests, when they are accessed at all. In addition, we create a zip with all the tests to distribution, which is grossly inefficient and we gets into size limits as the archive becomes larger than 2gb.
For the Chromium teams to scale properly, a more deterministic approach needs to be applied to the way tests are run. Guessing what is needed to run a test, manually keeping a list of executables to zip in a python script to send them over to testers is not scalable. So a totally different approach is to make the tests to run in a "pure" environment. Make sure running the tests is idempotent from the surrounding environment. From a directory-tree point of view, the best way to do it is the map the test executable into a temporary directory before running it. This is what this project is about.
The goal is to not have to sync the sources on testers to run a test. Syncing the sources is a constant cost that doesn't scale well. As we want to spread the test execution on more slaves, we need to reduce the constant costs since it's becoming the major cost of running a test on a slave!
Listing the dependencies to run a "step", in this case a test, enables the use of forward-pushing file system rather than faulting file system. E.g. the test runs with only its identified dependencies mapped in the directory tree.
We highly recommend reading the engtools blog posts about building in the cloud. In particular, read Testing at the speed and scale of Google and Build in the Cloud: Distributing Build Steps for background information about why you want to do that. But the whole blog is a good read.
The whole project is written in python.
You can find an overview presentation slides at https://docs.google.com/a/google.com/presentation/d/18DS0Za8s9O9hCei2I2KTHUPXFV39HfAe5hbZRiUzNG8/view. Sorry, Googlers-only.
All test distribution logic is inside the Swarm integration project.
In the current buildbot infrastructure, the special configuration "
See the general page for general information.
Here's a brief overview of the most important tools;
The .isolate file format is a very strict subset of the .gyp file format. The format is described by its parser in isolate.py named
The .isolate file format is essentially a .gypi that is imported in a foo_test_run target to get the list of dependencies to track and is also used as an input file to isolate.py. It's format is designed to be:
Two things I made explicit non-goals are:
In short, anything that couldn't be generated by the tools automatically. Overall, I want to avoid up to a certain extent the organic growth that is happening in the gyp files.
Here's a commented snapshot of src/base/base_unittests.isolate:
# This is the only form of supported condition.
# As an example, these are linux-specific files.
Currently (but bound to change) the only condition accepted is
The support for directory entries, that is an entry that ends with /, is extremely important. For example on OSX, it is impossible to know in advance the name of all the build outputs that will be required, since they embed the Chromium version, like 23.0.1262.0, in the OSX bundles and this information is not accessible from inside the
Symlinks are transparently supported. If a symlink is listed as a dependency, it is saved as a special file with no SHA-1 associated with it but only the symlink pointer.
This file is stored on the content addressed storage with all the other files and
The reason is to leave a lot of room to the tool generating the
That's the primary design decision to not use a tree-based storage for the file mapping (like the git tree objects) but a flat list (more like <some hash-table based file system> in a certain way). The reason is that splitting the high-churn files from the low-churn files is not necessarily directly describable in term of directories, so that's why the
It's important to clarify here that
Isolate server is a content-addressed cache that stores build results for a limited number of days. It natively supports precompressed data and tiered caching. It is a cache, not a permanent data store, so while the semantics can be similar to a content-addressed-datastore, it is not.
Each item's timestamp is refreshed on storage or request for presence. Fetching, like running
The cache uses a global 7 days eviction policy so objects are deleted automatically if not tested for presence.
To help future-proof the server, all the objects are stored in a specified namespace. The namespace is used as a signal to specify which hashing algorithm is used (defaults to SHA-1) and if the objects are stored in their plain format or transformed (compressed). The namespace logic also has a special case for temporary objects, any "temporary" namespace is evicted after a single day instead of the default 7 days.
It is interesting to compare the choice of embedding the hashing algorithm in the namespace instead of each key, like how camlistore does. It slightly reduces the strings overhead and simplifies sending the hashes as binary bytes. A single request handling several items doesn't have to switch of hashing algorithm per item. It is a requirement and is implicitly enforced that a single .isolated has all its items referenced in the same namespace.
Some files are more important that others. In particular,
To optimize small object retrieval, small objects (with a current cut off at 20kb, heuristics needs to be done to select a better value) are stored directly inline in the datastore instead of the AppEngine blobstore to reduce inefficient I/O for small objects.
Like most SCM like git and hg but unlike most CAS, isolateserver supports on-the-wire and in-storage compression while using the uncompressed data to calculate the hash key. Unlike git, isolateserver doesn't recompress on the fly and do not do inter-file compression.
The reason for the on-the-wire compressed transfer is to greatly reduce the network I/O. It is based on the assumption that most objects are build outputs, usually executables, so they are usually both large and highly compressible. It is important for that the .isolated files do not need to be modified to switch from the non-compressed namespace to a compressed one so the key is the same for the compressed and uncompressed version but they are stored in different namespaces.
The server is optimized for warm cache usage; the most frequent use case is that a large number of files are already in the cache on store operation. The way to do this is to batch requests for presence at 1000 items per HTTP request, greatly reducing the network overhead and latency. Then for each cache miss, that is, a '\0' byte is sent for the coresponding index in the hashes payload, the item is uploaded as separate HTTP POST.
The number of supported requests is designed to be limited for its specific intended use case:
It's interesting to look at the trade offs with a few content-addressed-storage systems. Note that the other CAS compared here are not caches but real datastores but the comparision is still useful from an optimization stand-point. Using git (a source control system), bup (a backup software based on git), camlistore (a one-size-fits-all datastore) as comparison.
Efforts have been looked at using a faulty file system instead, e.g. a copy-on-write compile plus mounting the partition on the testers to run the test from the checkout. The problem is that it puts a significant burden on the hardware providing these partitions and the round-trip latency is worsened, since the infrastructure has no idea what data will be needed upfront.
Keeping the test dependency list in
The latency of each step is optimized;
The latency is reduced by improving the Chromium's infrastructure scalability over more testers as VMs. To achieve better scalability, this project enables being able to confine each test to a limited view of the available files. The bottleneck will become:
There is currently no redundancy for the buildbot infrastructure, if a VM dies, it is simply replaced right away by a sysadmin. The hashtable datastore isn't redundant or reliable, but it is also discardable data. It can be rebuilt from sources if needed. If it fails, it will block the infrastructure but it is possible to switch from AppEngine back to NFS/Samba with some code changes (The NFS was used before but that code was removed).
The hashtable itself is going to require a valid GAIA account. The NFS/Samba datastore itself is not accessible outside of the DMZ but is to be considered of low security.
The test isolation code is unit and smoke tested. Since most of the isolate code is OS-independent, testing is relatively easy. Only the hardlinks and symlinks support need OS-specific code.