For Developers‎ > ‎How-Tos‎ > ‎

Reliability Tests

Note: As of 1/24/2012, the reliability bots have been providing no real results for 6-12 months.  Even green builds were returning warning codes that alerted to no access to data.  This has been a known issue while a replacement system is constructed.  Please ignore any purple/offline bots for now.


Understanding reliability test results


The Chromium Buildbot runs a large scale set of tests called the distributed reliability tests.  This system will automatically test new builds against thousands of pages from the web.  It will also fuzz-test the user interface against thousands of possible user action sequences.  The goal of this testing is to catch crashes early, in particular crashes that users are likely to see.

The results for each run are available on the Buildbot, see the Win Reliability builder.  Because each run takes 30-45 minutes, several changes are often batched together in the same run.  To look at the list of changes that went into a particular run, click on the "stdio" link of a reliability test run and find something like the following near the top:
Will stage build r15671 and get the complete result of previous build r15668.
or
Will get the partial result of current build r15668.
In either case, it tells you the corresponding svn revision of the build against which the test was run.

Similarly, look for the previous successful reliability result and find its revision, then you get the range of changes that turn the test to red.  One way of quickly examining the list of changes that went in between two revisions is to use the following URL template:
http://build.chromium.org/buildbot/perf/dashboard/ui/changelog.html?url=/trunk/src&mode=html&range=SUCCESS_REV:FAILURE_REV
The reliability tests use a set of criteria to determine whether or not the build should fail (i.e. turn red).  "New" crashes, that is crashes that are not already marked as known, will trigger a failure.  Additionally, if the crash rate is simply too high this will also cause the build to fail.

When a new crash occurs during a test, its crash dump is analyzed and a stack trace is produced for the functions that were on the stack at the point of the crash.  This stack trace is compared against a list of known crashes.  If none of the crashes in the known crashes list match the crash, the crash is considered new and the build will fail.

This system is not perfect.  Crash stacks for the same crash may change or have minor variants, causing them not to match any known crash and be incorrectly reported as new.  In other words, false positives are possible.  As described below, properly constructing a crash signature pattern of the crash is the key to reducing the number of false positives.  Over time, the system has proven to work well despite occasional false positives. 

Okay, so what do I do if the build turns red?

Most importantly, don't ignore it.  Please fix it ASAP following the instructions in this section.  Contact ace@chromium.org for more questions.

A red build may be the result of several factors.  It could mean that a new crash has been introduced, it could mean that there are too many crashes for the product to be considered stable, or it could be a false positive.  To help you identify what needs to be done in each case, use the following flow chart.  More details on each step are explained below.



The numbered steps below correspond to the steps in the above flow chart.

  1. First, identify what the cause of the build failure was.  This will be listed in the reliability test output, which you can find by clicking on the "stdio" link of the reliability test run.  If the build failed because of an unknown crash, the output will say "REGRESSION: NEW crash stack traces found".  If the build failed because there were too many crashes, the output will say "REGRESSION: Significant increase on crash rate".

  2. If the failure was due to a significant crash rate increase, it's a good idea to close the tree and revert the changes that likely caused the issue.  Follow the link near the top of the test output to see the list of changes in this run (see previous section for more info).

  3. If the failure was due to a new crash, first check whether or not the crash is a variant of one of the current known crashes.  The known crashes list is located at:

    src/chrome/test/data/reliability/known_crashes.txt

    Open this file in your favorite text editor.  This file contains crash signature patterns for known crashes, based on crash signatures.  Some definitions before proceeding:

    • Crash stack: the call stack of all functions when the crash happened.
    • Crash signature: a concatenation of all functions in chrome.dll on the crash stack.
    • Crash signature pattern: a pattern intended to match a specific crash signature or a certain class of crash signatures.  A pattern is intended to be a generalized, yet still identifying, form of a particular crash.  It can be a prefix of a crash signature, a substring, or something more complex such as a regular expression.

    For more information, see the documentation at the top of known_crashes.txt.

    The crash signature for the new crash will be listed on the reliability test output.  A new crash will look like the following:

    INFO: NEW stack trace signature found:
    file_util::writefile___`anonymous namespace'::savelatertask::run___messageloop::runtask___messageloop::dowork...

    Check if there are any crash signature patterns in the known crashes list that look very similar to the signature of the new crash.  If there are, the current signature pattern in the known crashes list may not be general enough.  You may want to make it more general (for example, by removing some functions from it) or you my need to add another pattern to catch this variant of the crash.

  4. If the new crash does not appear to be in the known crashes list, see if there is already a bug on file in the Chromium issue tracker.  Try searching for something like the top function on the stack, this usually works fairly well.  Try searching all issues, instead of just open ones, to see if a bug was already filed and closed (for example as WontFix).  Sometimes bugs will be closed due to a lack of information/reproducibility.

  5. Take a look at the severity of the crash, the number of occurrences, and the changes that may have caused the crash.  Use your best judgment to determine whether or not reverting changes is necessary.

    For example, we typically don't revert WebKit merges on account of new crashes.  Instead, we usually file bugs and wait for the fixes to go in upstream.

  6. If you decide not to revert any changes, please file a bug.  In the bug, mention the revision range in which the crash started occurring and, if possible, the most likely change(s) in that range that may have caused the crash.  Also include the entire crash stack, as shown on the reliability test output, in the bug report.  Please CC huanr@chromium.org on the bug.

  7. Make sure the known crashes list is updated appropriately.

    When adding or updating the known crashes list, try to pick a good signature pattern for the crash.  Usually a prefix pattern with the first 3-5 functions works well.  That is, the top several functions on the stack usually serve as a reasonably good unique identifier for the crash.

    All or part of the crash signature can be copied directly into the known crashes file.  For example, to create a crash signature pattern for the prefix of the above example crash, create a pattern like this:

    PREFIX : file_util::writefile___`anonymous namespace'::savelatertask::run

    In this example, any crash with these functions at the top of the stack will be filtered out by this pattern.

    Also, please include a comment above the pattern indicating the issue number for the associated bug.

Won't the known crashes list get out of date when crashes get fixed?


Yes.  If a crash is fixed, it should be removed from the known crashes file.  Otherwise, the crash could be silently reintroduced.

Cleaning up the known crashes list is currently a manual process.  We're thinking about ways to automatically update this list to keep it from getting stale.

How can I get more information about a crash? (Google internal only)


Visit http://chromebot and find the associated run, which will be named after the repository revision of the run.  From here you can download crash dumps and look at log output.

How can I reproduce a crash?

For site-related crashes, it should be a matter of navigating to the URL that is reported in the log. For UI test crashes, look the "UI sequence: SetUp,newtab,closetab,downloads,..." line in the log file. You can then run the automated_ui_tests binary with the flags --key=Setup,newtab,closetab,downloads,... --gtest_filter=AutomatedUITest.TheOneAndOnlyTest and only that sequence will be run.

Questions?


Ask ace@chromium.org.
Comments