Note: As of 1/24/2012, the reliability bots have been providing no real results for 6-12 months. Even green builds were returning warning codes that alerted to no access to data. This has been a known issue while a replacement system is constructed. Please ignore any purple/offline bots for now.
Understanding reliability test results
The Chromium Buildbot runs a large scale set of tests called the
distributed reliability tests. This system will automatically test new
builds against thousands of pages from the web. It will also fuzz-test
the user interface against thousands of possible user action
sequences. The goal of this testing is to catch crashes early, in
particular crashes that users are likely to see.
The results for each run are available on the Buildbot, see the Win Reliability
builder. Because each run takes 30-45 minutes, several changes are
often batched together in the same run. To look at the list of changes
that went into a particular run, click on the "stdio" link of a
reliability test run and find something like the following near the
top:
Will stage build r15671 and get the complete result of previous build r15668. or Will get the partial result of current build r15668.
In either case, it tells you the corresponding svn revision of the build against which the test was run.
Similarly, look for the previous successful reliability result and find its revision, then you get the range of changes that turn the test to red. One way of quickly examining the list of changes that went in between two revisions is to use the following URL template: http://build.chromium.org/buildbot/perf/dashboard/ui/changelog.html?url=/trunk/src&mode=html&range=SUCCESS_REV:FAILURE_REV
The reliability tests use a set of criteria to determine whether or not
the build should fail (i.e. turn red). "New" crashes, that is crashes
that are not already marked as known, will trigger a failure.
Additionally, if the crash rate is simply too high this will also cause
the build to fail.
When a new crash occurs during a test, its crash dump is analyzed and a
stack trace is produced for the functions that were on the stack at the
point of the crash. This stack trace is compared against a list of
known crashes. If none of the crashes in the known crashes list match
the crash, the crash is considered new and the build will fail.
This system is not perfect. Crash stacks for the same crash may change
or have minor variants, causing them not to match any known crash and
be incorrectly reported as new. In other words, false positives are
possible. As described below, properly constructing a crash signature
pattern of the crash is the key to reducing the number of false
positives. Over time, the system has proven to work well despite
occasional false positives.
Okay, so what do I do if the build turns red?
Most importantly, don't ignore it. Please fix it ASAP following
the instructions in this section. Contact ace@chromium.org for more questions.
A red build may be the result of several factors. It could mean
that a new crash has been introduced, it could mean that there are too
many crashes for the product to be considered stable, or it could be a
false positive. To help you identify what needs to be done in each
case, use the following flow chart. More details on each step are
explained below.
The numbered steps below correspond to the steps in the above flow chart.
- First, identify what the cause of the build failure was.
This will be listed in the reliability test output, which you can find
by clicking on the "stdio" link of the reliability test run. If the
build failed because of an unknown crash, the output will say
"REGRESSION: NEW crash stack traces found". If the build failed
because there were too many crashes, the output will say "REGRESSION:
Significant increase on crash rate".
-
If the failure was due to a significant crash rate increase, it's a
good idea to close the tree and revert the changes that likely caused
the issue. Follow the link near the top of the test output to see the
list of changes in this run (see previous section for more info).
-
If the failure was due to a new crash, first check whether or not the
crash is a variant of one of the current known crashes. The known
crashes list is located at:
src/chrome/test/data/reliability/known_crashes.txt
Open this file in your favorite text editor. This file contains crash
signature patterns for known crashes, based on crash signatures. Some definitions before proceeding:
-
Crash stack: the call stack of all functions when the crash happened.
- Crash signature: a concatenation of all functions in chrome.dll on the crash stack.
-
Crash signature pattern:
a pattern intended to match a specific crash signature or a certain
class of crash signatures. A pattern is intended to be a generalized,
yet still identifying, form of a particular crash. It can be a prefix
of a crash signature, a substring, or something more complex such as a
regular expression.
For more information, see the documentation at the top of known_crashes.txt.
The crash signature for the new crash will be listed on the reliability test output. A new crash will look like the following:
INFO: NEW stack trace signature found: file_util::writefile___`anonymous namespace'::savelatertask::run___messageloop::runtask___messageloop::dowork...
Check if there are any crash signature patterns in the known crashes
list that look very similar to the signature of the new crash. If
there are, the current signature pattern in the known crashes list may
not be general enough. You may want to make it more general (for
example, by removing some functions from it) or you my need to add
another pattern to catch this variant of the crash.
-
If the new crash does not appear to be in the known crashes list, see if there is already a bug on file in the Chromium issue tracker.
Try searching for something like the top function on the stack, this
usually works fairly well. Try searching all issues, instead of just
open ones, to see if a bug was already filed and closed (for example as
WontFix). Sometimes bugs will be closed due to a lack of
information/reproducibility.
- Take a look at the
severity of the crash, the number of occurrences, and the changes that
may have caused the crash. Use your best judgment to determine whether
or not reverting changes is necessary.
For example, we typically don't revert WebKit merges on account of new
crashes. Instead, we usually file bugs and wait for the fixes to go in
upstream.
-
If you decide not to revert any changes, please file a bug. In the
bug, mention the revision range in which the crash started occurring
and, if possible, the most likely change(s) in that range that may have
caused the crash. Also include the entire crash stack, as shown on the
reliability test output, in the bug report. Please CC
huanr@chromium.org on the bug.
-
Make sure the known crashes list is updated appropriately.
When adding or updating the known crashes list, try to pick a good
signature pattern for the crash. Usually a prefix pattern with the
first 3-5 functions works well. That is, the top several functions on
the stack usually serve as a reasonably good unique identifier for the
crash.
All or part of the crash signature can be copied directly into the
known crashes file. For example, to create a crash signature pattern
for the prefix of the above example crash, create a pattern like this:
PREFIX : file_util::writefile___`anonymous namespace'::savelatertask::run
In this example, any crash with these functions at the top of the stack will be filtered out by this pattern.
Also, please include a comment above the pattern indicating the issue number for the associated bug.
Won't the known crashes list get out of date when crashes get fixed?
Yes. If a crash is fixed, it should be removed from the known crashes
file. Otherwise, the crash could be silently reintroduced.
Cleaning up the known crashes list is currently a manual process.
We're thinking about ways to automatically update this list to keep it
from getting stale.
How can I get more information about a crash? (Google internal only)
Visit http://chromebot and find the associated run, which will be
named after the repository revision of the run. From here you can
download crash dumps and look at log output.
How can I reproduce a crash?
For site-related crashes, it should be a matter of navigating to the URL that is reported in the log. For UI test crashes, look the " UI sequence: SetUp,newtab,closetab,downloads,..." line in the log file. You can then run the automated_ui_tests binary with the flags --key=Setup,newtab,closetab,downloads,... --gtest_filter=AutomatedUITest.TheOneAndOnlyTest and only that sequence will be run.
Questions?
Ask ace@chromium.org.
|