ObjectiveEnables running tests outside their checkout by describing the unit test's dependency efficiently. The dependency list is generated using the Tracing infrastructure. This is done to enable the Swarm integration project. It depends on the Tracing tools. BackgroundThe Chromium waterfall current use a set of "builders" and testers". The "testers" need a full checkout to run the test since we currently have no idea which files are read by which tests, when they are accessed at all. In addition, we create a zip with all the tests to distribution, which is grossly inefficient and we gets into size limits as the archive becomes larger than 2gb. For the Chromium teams to scale properly, a more deterministic approach needs to be applied to the way tests are run. Guessing what is needed to run a test, manually keeping a list of executables to zip in a python script to send them over to testers is not scalable. So a totally different approach is to make the tests to run in a "pure" environment. Make sure running the tests is idempotent from the surrounding environment. From a directory-tree point of view, the best way to do it is the map the test executable into a temporary directory before running it. This is what this project is about. OverviewThe goal is to not have to sync the sources on testers to run a test. Syncing the sources is a constant cost that doesn't scale well. As we want to spread the test execution on more slaves, we need to reduce the constant costs since it's becoming the major cost of running a test on a slave! Listing the dependencies to run a "step", in this case a test, enables the use of forward-pushing file system rather than faulting file system. E.g. the test runs with only its identified dependencies mapped in the directory tree. We highly recommend reading the engtools blog posts about building in the cloud. In particular, read Testing at the speed and scale of Google and Build in the Cloud: Distributing Build Steps for background information about why you want to do that. But the whole blog is a good read. Components include:
The whole project is written in python. You can find an overview presentation slides at https://docs.google.com/a/google.com/presentation/d/18DS0Za8s9O9hCei2I2KTHUPXFV39HfAe5hbZRiUzNG8/view. Sorry, Googlers-only. InfrastructureThe isolate.py only interacts with a Content-addressed file datastore hosted on AppEngine It can also archives on an NFS/Samba share or locally.All test distribution logic is inside the Swarm integration project. In the current buildbot infrastructure, the special configuration " GYP_DEFINES=test_isolation_mode=hashtable" makes it archive to a hashtable. So enabling the test isolation infrastructure inside the current Chromium Continuous Integration infrastructure is really a matter of turning on a switch.Detailed designSee the general page for general information. Here's a brief overview of the most important tools;
Detailed design - .isolate file formatThe .isolate file format is a very strict subset of the .gyp file format. The format is described by its parser in isolate.py named load_isolate_as_config().Goals and non-goalsThe .isolate file format is essentially a .gypi that is imported in a foo_test_run target to get the list of dependencies to track and is also used as an input file to isolate.py. It's format is designed to be:
Two things I made explicit non-goals are:
In short, anything that couldn't be generated by the tools automatically. Overall, I want to avoid up to a certain extent the organic growth that is happening in the gyp files. ExampleHere's a commented snapshot of src/base/base_unittests.isolate: # Copyright (c) 2012 The Chromium Authors. All rights reserved.# Use of this source code is governed by a BSD-style license that can be# found in the LICENSE file.# The file is a stripped down dialect of python so comments start with '#'.{ # The global variables entry is the one defining OS-independent settings. 'variables': { # These are the files that should be tracked by the build tool (ninja/make/msbuild/etc). # They will be listed as a dependency for this step. 'isolate_dependency_tracked': [ # Paths are based at the .isolate's location. They are specified with '/' for consistency but # isolate.py replaces them with '\' on Windows. '../testing/test_env.py', # <(PRODUCT_DIR) is the variable PRODUCT_DIR and this variable means that the file is # located in the output directory. # <(EXECUTABLE_SUFFIX) is ".exe" on Windows and "" everywhere else. # As an example, the base_unittest result for linux release would be "../out/Release/base_unittests" # and on Windows debug "../build/Debug/base_unittests.exe". Later the paths are 'fixed' on # Windows to use '\' instead of '/' but that occurs after the variable replacement stage. '<(PRODUCT_DIR)/base_unittests<(EXECUTABLE_SUFFIX)', 'data/json/bom_feff.json', ], # These are the files that should NOT be tracked by the build tool. Reasons # includes: # - The file path contains a space in it. # - The file may not always be checked out, like for non-public test data files. # - The entry represents a directory. This is tagged with a trailing '/'. This means that # all the files in this directory and its subdirectory are going to be mapped to run the test. # This greatly simplify maintenance and reduces the number of files to be listed by an order # of magnitude. These entries are generated automatically by the isolate.py tool when it is # possible. 'isolate_dependency_untracked': [ 'data/file_util_unittest/', ], # These are files that are just touched by the tests. That means that we need the file to be # present when running the test, but it doesn't need to contain the real contents. 'isolate_dependency_touched': [ 'data/touched_file.txt', ], }, # This is the start of OS-dependent settings. 'conditions': [ # This is the only form of supported condition. ['OS=="linux"', { # The inner block of variables will be merged to the global block. 'variables': { # This is the command to run. It is not supported to have a global 'command' # entry that is overridden by a condition entry. So either there's a global # 'command' variable or one 'command' variable for each OS. 'command': [ '../testing/xvfb.py', '<(PRODUCT_DIR)', '<(PRODUCT_DIR)/base_unittests<(EXECUTABLE_SUFFIX)', ], # As an example, these are linux-specific files. 'isolate_dependency_tracked': [ '../testing/xvfb.py', '<(PRODUCT_DIR)/xdisplaycheck<(EXECUTABLE_SUFFIX)', ], }, # This is using the 'else' clause in gyp to do the equivalent of 'OS!="linux"'. }, { 'variables': { 'command': [ '../testing/test_env.py', '<(PRODUCT_DIR)/base_unittests<(EXECUTABLE_SUFFIX)', ], }, }], ['OS=="win"', { 'variables': { 'isolate_dependency_tracked': [ '<(PRODUCT_DIR)/icudt.dll', 'data/file_version_info_unittest/FileVersionInfoTest1.dll', 'data/file_version_info_unittest/FileVersionInfoTest2.dll', ], }, }], ],}'OS=="<os>"' and nothing else. The available <(FOO) variables depend on the foo_unittests_run target that define these with the --variables flag.Directory entriesThe support for directory entries, that is an entry that ends with /, is extremely important. For example on OSX, it is impossible to know in advance the name of all the build outputs that will be required, since they embed the Chromium version, like 23.0.1262.0, in the OSX bundles and this information is not accessible from inside the .gyp file.Symlinks are transparently supported. If a symlink is listed as a dependency, it is saved as a special file with no SHA-1 associated with it but only the symlink pointer. Detailed design - .isolated file formatA .isolated file is generated by isolate.py when run in mode run, trace, check, hashtable. It lists the exact content of each file that needs to be mapped in the relative path, the content being addressed by it's sha-1, and the command that should be used.This file is stored on the content addressed storage with all the other files and run_isolated.py reads it to remap the expected directory tree. The exact format is prescribed by load_manifest() in run_isolated.py. It consist of a json file with the primary keys:
Arbitrary split vs tree of treesThe .isolated format supports the includes key to split and merge back list of files in separate .isolated files. It is in stark contrast with more traditional tree of trees structure like the git tree object rooted to a git commit object.The reason is to leave a lot of room to the tool generating the .isolated files to be able to package low-churn files versus high-churn files. As a practical example, we can state that test data files are low churn, the odd of having files modified is in the range of a few times a days maximum. The files in <(PRODUCT_DIR) are high-churn files since they are usually different at each build. It's up to the tool generating the .isolated files to generate the most optimal setup, in this case isolate.py.That's the primary design decision to not use a tree-based storage for the file mapping (like the git tree objects) but a flat list (more like <some hash-table based file system> in a certain way). The reason is that splitting the high-churn files from the low-churn files is not necessarily directly describable in term of directories, so that's why the .isolated format has support for includes' ntries that permits an .isolated file listing the high-churn files to include a .isolated listing the low-churn files.It's important to clarify here that includes for .isolate are not related in any way to the includes in .isolated files. (Reread the sentence if necessary). The first one is to reuse a common list of runtime dependencies for multiple targets, the second is to optimize the overall size of the .isolated files to archive and load for slaves.Detailed design - isolateserverIsolate server is a content-addressed cache that stores build results for a limited number of days. It natively supports precompressed data and tiered caching. It is a cache, not a permanent data store, so while the semantics can be similar to a content-addressed-datastore, it is not. Object eviction (GC)Each item's timestamp is refreshed on storage or request for presence. Fetching, like running run_isolated.py, never updates the timestamp. Only storing does, like running isolate.py hashtable. Object's timestamp is updated even if the object was already stored by a side-effect of /content/contains/.The cache uses a global 7 days eviction policy so objects are deleted automatically if not tested for presence. NamespacesTo help future-proof the server, all the objects are stored in a specified namespace. The namespace is used as a signal to specify which hashing algorithm is used (defaults to SHA-1) and if the objects are stored in their plain format or transformed (compressed). The namespace logic also has a special case for temporary objects, any "temporary" namespace is evicted after a single day instead of the default 7 days. It is interesting to compare the choice of embedding the hashing algorithm in the namespace instead of each key, like how camlistore does. It slightly reduces the strings overhead and simplifies sending the hashes as binary bytes. A single request handling several items doesn't have to switch of hashing algorithm per item. It is a requirement and is implicitly enforced that a single .isolated has all its items referenced in the same namespace. PrioritiesSome files are more important that others. In particular, .isolated files must have much lower fetch latency than the other ones since they are the bottleneck to fetch more data, i.e. all the dependencies. These high-priority files are stored in memcache in addition to the datastore, so the retrieve operation can complete with a lower latency.Object sizesTo optimize small object retrieval, small objects (with a current cut off at 20kb, heuristics needs to be done to select a better value) are stored directly inline in the datastore instead of the AppEngine blobstore to reduce inefficient I/O for small objects. Explicit compression supportLike most SCM like git and hg but unlike most CAS, isolateserver supports on-the-wire and in-storage compression while using the uncompressed data to calculate the hash key. Unlike git, isolateserver doesn't recompress on the fly and do not do inter-file compression. The reason for the on-the-wire compressed transfer is to greatly reduce the network I/O. It is based on the assumption that most objects are build outputs, usually executables, so they are usually both large and highly compressible. It is important for that the .isolated files do not need to be modified to switch from the non-compressed namespace to a compressed one so the key is the same for the compressed and uncompressed version but they are stored in different namespaces. Optimized for warm store, warm fetchThe server is optimized for warm cache usage; the most frequent use case is that a large number of files are already in the cache on store operation. The way to do this is to batch requests for presence at 1000 items per HTTP request, greatly reducing the network overhead and latency. Then for each cache miss, that is, a '\0' byte is sent for the coresponding index in the hashes payload, the item is uploaded as separate HTTP POST. URL endpointsThe number of supported requests is designed to be limited for its specific intended use case:
Comparison to a few off-the-shelf CAD/CAS solutionsIt's interesting to look at the trade offs with a few content-addressed-storage systems. Note that the other CAS compared here are not caches but real datastores but the comparision is still useful from an optimization stand-point. Using git (a source control system), bup (a backup software based on git), camlistore (a one-size-fits-all datastore) as comparison.
Future work
Project information
CaveatsEfforts have been looked at using a faulty file system instead, e.g. a copy-on-write compile plus mounting the partition on the testers to run the test from the checkout. The problem is that it puts a significant burden on the hardware providing these partitions and the round-trip latency is worsened, since the infrastructure has no idea what data will be needed upfront. Keeping the test dependency list in .isolate files clean from stale entries will be tricky. This would require occasional automated tracing of the test to figure out if a file is not accessed anymore and trigger a warning accordingly. Since this tracing significantly slows down the test performance, especially on Windows, this can't be kept enabled for all test shard executions.LatencyThe latency of each step is optimized;
ScalabilityThe latency is reduced by improving the Chromium's infrastructure scalability over more testers as VMs. To achieve better scalability, this project enables being able to confine each test to a limited view of the available files. The bottleneck will become:
Redundancy and ReliabilityThere is currently no redundancy for the buildbot infrastructure, if a VM dies, it is simply replaced right away by a sysadmin. The hashtable datastore isn't redundant or reliable, but it is also discardable data. It can be rebuilt from sources if needed. If it fails, it will block the infrastructure but it is possible to switch from AppEngine back to NFS/Samba with some code changes (The NFS was used before but that code was removed). Security ConsiderationsThe hashtable itself is going to require a valid GAIA account. The NFS/Samba datastore itself is not accessible outside of the DMZ but is to be considered of low security. Testing PlanThe test isolation code is unit and smoke tested. Since most of the isolate code is OS-independent, testing is relatively easy. Only the hardlinks and symlinks support need OS-specific code. |
