Design: Tracing tools

Objective

Generate the exact list of test data files needed to run a test executable using automated means across multiple OSes.

Background

The Chromium build system, basically building Chromium, is (almost) deterministic. The build is declared in the GYP language, which lists all the source files and supports conditions to work with different OSes, different targets, like with Aura or with Valgrind. As such, it is possible to build a single source file and knowing exactly what input files need to be present to compile it.

Running tests have always assumed they are run directly in the directory they were built in, that is, all the source tree is available. This has huge cost, as only a tiny fraction of these files are needed, and an even a smaller fraction from the files that were built.

Generating the exact list of needed files to run a test would require an insane amount of manual work so it must be automated and work across multiple OSes.

Overview

  1. For each OS,
    1. An OS-specific tracing platform is chosen.
    2. Python glue code runs an executable under the system-provided tracer.
    3. Python glue code reads the trace file and generates an OS-independent data structure.
    4. An API is used to manipulate this data in an almost OS-independent way.

Infrastructure

For now, no infrastructure is needed as this is all python code.

Detailed Design


There are 2 importants classes; ApiBase and Results.

ApiBase

ApiBase declares the common interface to trace a child process, independent of the implementation or the OS. It has a limited interface and separates the action of tracing from the action of analyzing the traces. ApiBase.get_tracer() returns an ApiBase.Tracer derived instance that manages the lifetime of the tracing infrastructure. Once tracing is done, ApiBase.parse_log() generates a list of Results instance.

The implementation is to use dtrace on OSX, "NT Kernel Logger" on Windows and strace on linux. Much of the particularities between each implementation is described in the docstring of each implementation class. As a starter, strace is a pure user-mode tracer, while dtrace and the NT Kernel Logger are kernel-only tracers, which have no effect on the traced processes. The API hides this fact and have a consistent API across user mode and kernel mode tracers.

Results

It holds the results of the trace. It is immutable so any mutation with one of its member function effectively create a new instance. The results contains a tree of Results.Process instances, each representing the root process and each child process started by the root process. Each process has a .files attribute that list each of the files accessed by this process. It also contains meta data like the command line, the executable name, etc.

Google-test enabled tracing

Chromium uses GTest to drive its unit tests. To improve efficiency, a trace can just work on a subset. The script trace_test_cases.py traces gtest test executable, each test case individually. It reduces the wall-time duration taken to trace the tests by tracing multiple tests simultaneously. The results is rich detailed information about each of each test case.

Project information

It's source lives inside the chromium tree at http://src.chromium.org/viewvc/chrome/trunk/tools/swarm_client along the Isolate scripts. It was mainly coded by maruel@ with the assistance of csharp@.

Caveats

  1. The implementation requires a lot of OS-specific work.
  2. Some tracers still have bugs, for example the default strace version (4.5) included in Ubuntu has internal race conditions so the trace logs can be corrupted even if the test case ran properly, so the trace needs to be run a second time.
  3. The tracing infrastructure causes an enormous performance hit on OSX, making hardly usable.
  4. The tracing infrastructure on Windows is not 100% accurate, it occasionally list files that were not touched by the executable and the tracer itself is not logging sufficiently to output as much data than with dtrace or strace.

Latency

Wall time duration is reduced by tracing multiple test cases concurrently. This significantly help on 8+ cores systems. No distributed tracing infrastructure is planned yet.

Scalability

Only system-local scalability is considered, as no unattended "Tracing a unit test on Swarm" has been implemented yet (tracing on swarm probably won't be implemented since the swarm bots would have to download the full chromium checkout). The scalability effort is to make sure all the CPU cores are used by tracing multiple test cases simultaneously, reducing wall-time to get the complete results for an unit test executable.

Redundancy and Reliability

N/A

Security Consideration

N/A as the tool require root access on OSX and Windows.

Testing plan

The testing is in two parts;
  1. Testing the code itself so its behavior corresponds to the specified design.
  2. The OS-specific tracing infrastructure works as intended. For example, an OS update, like OSX 10.6 to 10.7, caused different functions to be called, which required changed to the D code for dtrace usage.
So the automated smoke test must be run on each of the supported platform. This is currently done manually due to the limited number of contributors.
Comments