Memory sheriff

Memory sheriffs should watch the "Memory FYI" waterfall. Yes, I don't like this confusion too.

Everything for a chromium sheriff applies plus:

Join the fun

If you want to help maintain a memory overflow free tree, please contact thestig@.

What to watch

  • Memory bot console.
    Failures on the memory bots are just as important as normal failures, even though they can up to 1.5 hours to cycle.

  • Your @google.com inbox.
    When a unique failure happens, the memory sheriff and the blamelist committers should receive an e-mail.
    Please note the sender is throttled though so don't rely on e-mail only.
    If you believe the e-mail sender is completely broken, ping timurrrr@

  • IRC #chromium on freenode.
  • Be available on IM.

Tools on the memory waterfall

  • Memcheck a.k.a. Valgrind (Linux, Chrome OS, Mac) - finds memory errors like memory leaks, accesses to uninitialized or un-allocated memory etc.
  • ThreadSanitizer (Windows, Linux, Mac) - finds data races.
  • HeapChecker (Linux) - finds memory leaks. It is much faster than Valgrind but can't substitute it completely because of the limited functionality.

Sheriffing Tools

We have several tools designed to simplify the sheriffing duties.
First, waterfall.sh, try it like this:
sh tools/valgrind/waterfall.sh
Please read the chromium-dev thread about this script for the basic idea and some how-to's.

Next, there is scan-build.py. It allows you to scan through build logs looking for common terms. (Most often a error hash, so you can quickly see when an error first surfaced)

scan-build.py --update updates the local cache to the latest state
scan-build.py --find <string> looks through all build logs for a string

Finding travels backwards until it hasn't encountered the search term for a given number of builds. (CUTOFF, currently set to 100). If it travels further backwards than your current cache is filled, it will automatically fetch more. It will however _not_ fetch new logs.

Updating makes sure that we have at least CUTOFF builds locally available, and catches up to the latest build logs.

What to do with failures on the Memory FYI waterfall

There are two main types of failures you can observe on the memory bots: memory reports detected and test failures.
Both are actionable by either fixing the code (probably reverting a recent change) or suppressing/excluding the failures.

Recomendation: consider sending your patches to the next sheriff on the schedule. Memory errors are not fixed fast usually, so it's good to be up to date before you start your sheriffing shift.

When to close the tree or revert

Since the bots on the Memory FYI waterfall cycle slowly, it's hard to keep up with what's happening on these slaves so we don't close the tree automatically as other waterfalls do.
You may want to close the tree manually to throttle commits so you can commit your suppressions faster.
You can close the tree by typing "Tree is closed (Memory FYI waterfall is too red)" at http://chromium-status.appspot.com/

Please note that some of the reports indicate serious bugs (e.g. "unaddressable access", "use after free", etc. - they are likely to affect stability/security).
If you see a new serious report and it's clear which change caused it - go ahead and revert.
Also, the same rule applies to not-so-serious reports: if you see a recent commit with an obvious bug which
showed up on the Valgrind bots, talk to the commiter if he's OK with reverting and polishing his CL. This is something like an unsolicited code review, right? :) 

Suppressing memory reports

We suppress some of the memory reports, either because they are from system libraries we can't do anything about, or because we already have bugs filed in the Chromium issue tracker.
By suppressing errors instead of excluding tests we still get coverage for the tests with known memory reports.

NOTE: Suppressions may hide real bugs. Please don't suppress to much - consider reverting instead!
Check that the suppression for each bug is removed as soon as the bug is fixed (ideally in the same CL).
Also, please take time to prune unused suppressions.
You can check if a suppression is used using tools/valgrind/unused_suppressions.py or this dashboard.

The chrome_tests.sh/bat scripts read overall suppressions from several sources:
  • tools/valgrind/memcheck/suppressions.txt (for Valgrind Memcheck, used on all platforms)
  • tools/valgrind/memcheck/suppressions_mac.txt (additional Mac-only file for Memcheck)
  • tools/valgrind/tsan/suppressions.txt (for ThreadSanitizer, used on all platforms)
  • tools/valgrind/tsan/suppressions_win32.txt (ThreadSanitizer on Windows)
The general form is tools/valgrind/TOOL/suppressions[_PLATFORM].txt, where TOOL is one of: memcheck, tsan, drmemory; and PLATFORM is linux, mac, win32 or empty.
One exception is HeapChecker which has its own path tools/heapcheck/suppressions.txt

In general, any suppression that is there because of a bug in chromium should be named bug_NNNNNN where NNNNNN is the chromium bug number, and the changeset that adds that suppression should include the string BUG=NNNNN in its description.

The runner script automatically generates suppressions for all unique errors reported, like this:
22 bytes in 1 blocks are definitely lost in loss record 491 of 3,129 // this is a report
malloc (mp/scripts/valgrind-memcheck/coregrind/m_replacemalloc/vg_replace_malloc.c:241)
WTF::fastMalloc(unsigned int) (third_party/WebKit/JavaScriptCore/wtf/FastMalloc.cpp:249)
WebCore::StringImpl::createUninitialized(unsigned int, unsigned short*&) (third_party/WebKit/JavaScriptCore/wtf/text/StringImpl.cpp:96)
WebCore::StringImpl::create(unsigned short const*, unsigned int) (third_party/WebKit/JavaScriptCore/wtf/text/StringImpl.cpp:108)
WebCore::StringImpl::substring(unsigned int, unsigned int) (third_party/WebKit/JavaScriptCore/wtf/text/StringImpl.cpp:186)
WebCore::String::substring(unsigned int, unsigned int) const (third_party/WebKit/JavaScriptCore/wtf/text/WTFString.cpp:257)
WebCore::KURLGooglePrivate::componentString(url_parse::Component const&) const (third_party/WebKit/WebCore/platform/KURLGoogle.cpp:313)
[SNIP - some random stuff e.g. MessageLoop, DispatchToMethod etc]
The report came from the `AutomationProxyVisibleTest.WindowGetViewBounds` test.
Suppression (error hash=#0CAC77B0AD40A91D#):
{
<insert_a_suppression_name_here> // file a bug and replace it with bug_NNNNN before commiting
Memcheck:Leak
fun:malloc
fun:_ZN3WTF10fastMallocEj
fun:_ZN7WebCore10StringImpl19createUninitializedEjRPt
fun:_ZN7WebCore10StringImpl6createEPKtj
fun:_ZN7WebCore10StringImpl9substringEjj
fun:_ZNK7WebCore6String9substringEjj
fun:_ZNK7WebCore17KURLGooglePrivate15componentStringERKN9url_parse9ComponentE
[SNIP]
}

First, check there's no similar suppression in the corresponding suppression files.
It may just need some wildcarding.

If there's no such suppression, copy everything in between {...} and add it to the appropriate suppressions file, e.g.:
  • If a ThreadSanitizer failure is likely cross-platform, add the suppression to tools/valgrind/tsan/suppressions.txt
  • If a Valgrind failure is Mac-specific, add the suppression to tools/valgrind/memcheck/suppressions_mac.txt
  • etc
Make sure to file a bug (see recommendations below) and use the bug number as the name of the suppression.
Also, you may consider looking through the suppression stack to replace unrelated frames with "..." (matches any number of lines of stack) or "fun:*" (matches one line).
{
   bug_56789
   Memcheck:Leak
   fun:malloc
   fun:_ZN3WTF10fastMallocEj
   fun:_ZN7WebCore10StringImpl19createUninitializedEjRPt
   fun:_ZN7WebCore10StringImpl6createEPKtj
   fun:_ZN7WebCore10StringImpl9substringEjj
   fun:_ZNK7WebCore6String9substringEjj
   fun:_ZNK7WebCore17KURLGooglePrivate15componentStringERKN9url_parse9ComponentE
}

Submitting a patch

Now send the patch for review.
Review recommendations:
  • use "NOTRY=true" in the review description to avoid using CQ tests (they're useless for suppressions)
  • use "TBR=reviewer" to save time if you're comfortable with writing suppressions, otherwise use a suppression reviewer from your timezone (ping him to make a quick review!)
  • don't forget to mention BUG=NNNNNN in the changelist description
Now commit.

When the leak gets fixed, make sure to ask the person who fixes it to remove the suppression again -- ideally in the same CL that contains the fix.

Excluding tests

Some tests run slowly or poorly under Valgrind.
If they fail even without Valgrind, just add the DISABLED_ prefix to the test case name.

If tests are hanging or crashing only on Valgrind, disable them using the files in tools/valgrind/gtest_exclude/test_binary.gtest[_platform].txt,
where test_binary is (base_unittests, ui_tests, etc) and platform can be none (both Linux and Mac), linux or mac.
Again, HeapChecker is an exception - use tools/heapcheck/test_binary.gtest.txt instead.
Please file bug(s) for any tests you disable and point at the bug(s) where you exclude the test(s)!

For example, if ExampleTest.PeelOranges from unit_tests fails under Valgrind on both Linux and Mac, add the following to tools/valgrind/gtest_exclude/unit_tests.gtest.txt:
# Crashes when run under Valgrind.  http://crbug.com/4567
ExampleTest.PeelOranges

Filing good memory bugs

Whenever you add a suppression to one of the suppression files or exclude a test, you are required to also file a bug to track the error.

A good bug report should have the following:

  1. A link to the build cycle that the error first started appearing.
    Linking to an arbitrary result with the failure is not helpful; you should go back through the buildbot results to find when the error first started occurring and link to that cycle.
    See the next section for some tips on tracking this information down if it is not obvious from the buildbot logs.

  2. The output of the error, i.e the report, the test name and the generated suppression.
    Buildbot logs are only kept for a finite amount of time.
    You should always paste the symbolicated backtrace as well as the mangled suppression so that if the bug is left open for a while, the report is still useful.

  3. The revision corresponding to the stack traces in the output.

  4. Put the author of the CL that most likely caused the error in the Owner field.
    Use the failing build blamelist and 
    svn annotate (including the web version which is slow but still useful).
    Chromium codesearch can also be helpful.

    If there isn't an obvious author, you should CC the part of the blamelist of the build cycle and/or the past authors of the suspicious code that could be guilty.
    Sometimes the reports are flaky and show up only after a number of runs (especially Valgrind leak reports).
    In this case, please explicitly say that the report is flaky.

  5. This relates to the first point: it is very important to track down the first instance of the failure so this information is accurate.

  6. Apply appropriate labels to the bug: Stability-ValgrindStability-ThreadSanitizer, Stability-HeapChecker, Stability-DrMemory.
    Also indicate the platform on which the failure occured using the OS labels.

  7. When the bug is fixed - check that the suppression is removed!

The title of the bug report should indicate the type of error that Valgrind reported (e.g. "Memory leak in Foo.Bar since r123456")
The type of failure is reported as the second line of the suppression.
Some common ones are 
Memcheck:LeakMemcheck:UninitializedThreadSanitizer:Race, and Memcheck:Unaddressable.
If you see the latter one, you may consider reverting the guilty change (see related section above).

WebKit Valgrind

The WebKit Layout tests are run under Valgrind. Due to the vast number of these tests only a portion (1500 or so) are run on each iteration.

This implies a couple of things:
  • that flakes can be expected since the tests are different each run, and 
  • there is little or no correlation with the CL range at the time a particular test failed.
Ideally, both the WebKit and Memory sheriffs will log bugs for errors observed. When logging these bugs the best owner may be determined using svn/git blame on the WebKit tree.
If no owner is obvious then adding the "Area=WebKit label:Memory label:Valgrind" tags will help these get triaged appropriately.

Upcoming sheriffs (not authoritative)

This list is meant to be a quick reference, but are not updated automatically and are not authoritative. The authoritative lists are in the calendars. See how to swap if you can't make it.
  • 2013-04-18 to 2013-04-23: oshima
  • 2013-04-23 to 2013-04-26: glider
  • 2013-04-26 to 2013-04-30: timurrrr
  • 2013-05-01 to 2013-05-04: thakis
  • 2013-05-06 to 2013-05-09: groby
  • 2013-05-09 to 2013-05-14: rnk
  • 2013-05-14 to 2013-05-17: eugenis
  • 2013-05-17 to 2013-05-22: dhollowa
  • 2013-05-22 to 2013-05-25: zhaoqin
  • 2013-05-27 to 2013-05-30: timurrrr 
  • 2013-05-30 to 2013-06-04: bruening
  • 2013-06-04 to 2013-06-07: thestig
  • 2013-06-07 to 2013-06-12: oshima
  • 2013-06-12 to 2013-06-15: glider
  • 2013-06-15 to 2013-06-20: groby
  • 2013-06-20 to 2013-06-25: thakis
  • 2013-06-25 to 2013-06-28: eugenis
  • 2013-06-28 to 2013-07-03: dhollowa
  • 2013-07-03 to 2013-07-06: bruening
  • 2013-07-08 to 2013-07-11: zhaoqin
  • 2013-07-11 to 2013-07-16: timurrrr
  • 2013-07-16 to 2013-07-19: rnk
  • 2013-07-19 to 2013-07-24: glider
  • 2013-07-24 to 2013-07-27: oshima
  • 2013-07-29 to 2013-08-01: thestig
Comments