Sheriff Log: Chromium OS (go/croslog)

Please update go/cros-sheriff-playbook when you find a build/infra failure and can map it to what action the sheriff should take for it.

Sheriffs: chadversary, oliverwen, mnissler(non-PST):
  • 714571: HWtest invocation times out
  • 714598Image signing step fails on release builders
  • 714601: logging_UserCrash fails on x86-generic-incremental
  • 714608: vmtest timeout on ASAN bots

Sheriffs: tbroch, zhihongyu, owenlin(non-PST):

  • 697274: daisy-skate-chrome-pfq: running hwtest but cq doesn't. CL to remove hwtest from pfq
  • CQ:14360: arc-camera breakage -> reverts 14364
  • 713531: security_SandboxStatus fails.  Remove from bvt-inline for now
  • 713004: Tests passed but got aborted by AutotestAbort
    • 713856: network load likely suspect. 10x increase in file size since 4/12
    • 4/20: unpin chrome not clear it can be blamed
    • 4/19: pin chrome to 59.0.3064.0_rc-r1 as workaround
  • 713226: boost, python-gflags, other pkgs changed on mirror mismatch manifest.
  • 712679: canary builders failing : long build time for chromeos-chrome
  • 712297: chrome PFQ failures, goma enablement side-effect for TestSimpleChromeWorkflow
  • 712102: veyron_minnie-chrome-pfq bots look full-disk.
  • 712109: cyan-chrome-pfq is failing due to libstdc++ version mismatch.
  • 685889: (dup) veyron_mighty_paradin, winky_paradin failed due to (IntegrityError: "Duplicate entry for key 'buildbucket_id_index'")
  • 712505: Fizz Paladin: Failed steps failed cbuildbot: failed androidmetadata
  • 689105: /usr/bin/python: bad interpreter: No such file or directory in autoupdate_EndToEndTest
  • 697967: ASAN failures : no space left during build_images

    Sheriffs: shchen, philipchen, itspeter(non-PST):
    • 689105: /usr/bin/python: bad interpreter: No such file or directory in autoupdate_EndToEndTest
    • 708679: Some shards stop taking RPC calls
    • 708715: Frequent pre-cq failures on caroline.
    • 708429:4/5 : suspect pre-cq-launcher has permission issue, Closing the Tree
    • Suspect CL:465488 breaks pre-cq-launcher #8925, revert and restart pre-cq.
    • 707696: itspeter@ believe Builder master-paladin Build #14172 is marked incorrectly. It should be green based on issue.
    • 707629master-paladin failed continuously, suspect a slave is missing python package but not able to investigate further. Looks flaky as it passed on guado_moblab-paladin #5520
    Sheriffs: pberny, norvez, wnhuang (non-PST)
    • 689105 /usr/bin/python: bad interpreter: No such file or directory in autoupdate_EndToEndTest
    • 703914 platform_MemCheck is flaky => flaky test
    • 699353 desktopui_ScreenLocker FAIL: Unhandled DevToolsClientUrlError => flaky test (Chrome crash)
    • 703789 graphics_Gbm: DUT Rebooted unexpectedly nyan_kitty. => flaky test
    • 690307 swap shard workload => Fixed
    • 704669  (resolved) Reef derivative canaries have broken linux firmware
    • 704381 "Report" build step doesn't time out refreshing access tokens for gsutil
    • 704194 (Two) Asuka devices not coming up after reboot during AUTest
    • 705247 android image signing failing due to out of space

    Sheriffs: smbarber, hoegsberg, shunhsingou (non-PST)
    • 701400 Repair flow no longer working for guado_moblab
    • 701693 SSH connection fails for veyron_speedy-paladin/veyron_mighty-paladin
    • 689105 /usr/bin/python: bad interpreter: No such file or directory in autoupdate_EndToEndTest

    Sheriffs: leecy, scollyer (PST), littlecvr (non-PST)
    • 699353 desktopui_ScreenLocker FAIL: Unhandled DevToolsClientUrlError
    • 698825 caroline gets canceled because the build takes too long to finish
    • 700021 gsutil issue in container setup causes "missing lockfile" failure
    • 695287 Slowness and 502 errors from cautotest AFE because of cautotest mysql slowness

    Sheriffs: moch, marcheu (PST)
    • 696606 devserver load may contribute to some provision failures
    • 696696 desktopui_MashLogin | FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run.
    • 698096 some canaries are running out of time
    • 694081 ARC availability check
    • 693610 tko_parser error
    • 694642 missing autoserv logs
    • 690822 CTS scheduling
    • 694755 chromeos.branch dying
    • 695172 cyan-chrome-pfq stuck
    • 695733 chrome re-pin
    • 695641 pre-cq-launcher failures due to oauth token invalidation
    • 695529 excessive provisioning errors
    • 696039 several jetstream flakes
    • 695940 kevin FW re-update
    • 639301 cyan stuck on shutdown
    Sheriffs: ejcaruso, mqg (PST), adurbin (non-PST)
    Infra: shuqianz
    • Generally swarming issues and network problems have been a huge problem this week.
    • reef, snappy, and pyro release builders were all marked important on 2/14
    • 693734 guado_moblab: AndroidMetadata failure; no ebuilds to satisfy "x11-base/xorg-server"
    • 693691 falco-release: suite timeouts (maybe network related? logs are bad, this is also happening on other boards)
    • 693597 nyan_kitty: CQ test failure
    • 693331 nyan_kitty: all CQ DUTs failed to provision
    • 693318 peppy: generic_RebootTest failure
    • 693313 breakpad compile failure from -Werror,-Winconsistent-missing-override
    • 693310 guado_moblab: broken CL made it past the CQ somehow
    • 693101 lab DHCP server configuration update took out the whole lab
    • 692342 kevin: provision failure loops (possible eMMC failures?)
    • 692236 falco_li: not enough DUTs to test canary
    • 692232 peppy: failed to provision
    • 692214 caroline: canary paygen issues
    • 692206 clapper: VMTest broken
    • 692129 snappy: no good repair build (unstable ToT)
    • 691729 kevin: unable to reach devserver
    • 690616 caroline: failed to perform stateful update (continued from last week)
    • 690286 reef: cs50-updater causing reboots and rollbacks (continued from last week)
    • 690232 candy: dbus issues causing canary failures (continued from last week)
    • 692240 setzer was moved between servers, resulting in some planned throttling
    Gardener: jennyz
    • 692247 falco-chrome-pfq, daisy_skate-chrome-pfq: failed to connect to DUT after AU
    • 687248 falco-chrome-pfq: flakiness in provisioning prevents chrome uprevs (continued from last week)

    Sheriffs: jinjingl, waihong
    • 691009 daisy_skate CQ: Devserver call failed: "" => Restarted devserver.
    • 690616 Coreline canary: Failed to perform stateful update
    • 690232 Candy: The name org.chromium.UpdateEngine was not provided by any .service files
    • 690286 no green build for reef family
    • 689794 samus-android-pfq failing HWTest - CrOS auto-update failed
    • 689694 CQ Failing Gerrit Unittests - gaierror => Fixed the test
    • 689105 multiple autoupdate_EndToEndTest failures at about 6:40 => Reverted CL
    • 689072 build_image failing again in canary archive step with cryptic error => Reverted CLs
    Gardeners: michaelpg, glevin
    • frequent falco-chrome-pfq failures. Suspect DUT replaced, but issue ongoing.
    • VideoPlayerBrowserTest.OpenSingleVideoOnDrive still flaky.
    • (fix in review): LKGM builder fails 50% of nights. Uploaded CL, ensured a run succeeded, and updated YAQS.
    • (resolved): depot_tools CL breaks SyncChrome step on canaries and PFQ; quickly reverted by dpranke@
    • (resolved): linker failure on amd64-generic-tot-asan-informational
    • (resolved): piex_loader.js is noisy in chromium browser_tests
    • various (resolved): flaky tests on Linux ChromiumOS. CLs reverted.
    • ketakid: PFQ failure for samus on 57 branch (fix here)
    Sheriffs: uekawa
     - stateful.tgz missing from caroline dev release builds. -- manually fixed
     - devserver down due to disk full, cleanup script wasn't running due to manifest. -- resolved and pushed.
     - dhcp outage caused lots of ssh connection timeout. -- should be resolved.
     - lakitu-paladin failing with GS upload failure. -- ACL was fixed.
     - lakitu-gpu-incremental has never succeeded -- a change went in.
     - falco-chrome-pfq failure. -- tried locking 
     - signers timing out
     - libinstallattributes failing with asan build. now fails with another failure.
     - seems to be failing all builders now with failing to uprev, what!?

    1/23 - 1/25
    Gardeners: jamescook, warx
    • 683977 git lockfiles breaking chromeos amd64-generic Trusty builder. Resolved.
    • 683640 FAIL: Test did not complete due to Chrome or ARC crash. Java version issue. Disabled.
    • 684044 "All devservers are currently down" - incorrectly blaming *all devservers* when a single devserver call flakes. WontFix.
    • 683304 falco-chrome-pfq failures. Infra / test problem. Fix in flight.
    • 674209 constant video_ChromeHWDecodeUsed failures in tricky/peach_pit informational pfq. Reverted.
    • 685313 linker failure on chromeos asan in libbrillo, "may overflow at runtime; recompile with -fPIC". Toolchain? Still failing.
    • 685340 chromeos Chrome LKGM builder failing in cros_best_revision, git cl land failure. Flaky. Infra?
    • 685424 scheduler: Aborting large number of bvt-prebuild request from past canary causes slowdown, CQ failure. Ongoing.
    • 683828 Chrome compile failure, openh264 cpu architecture. Reverted.
    • 685675 Manually uprev Chrome to 58.0.2993.0 for Chrome OS
    • BuildPackages failure due to camera HALv2 autotest (not chrome), Reverted.
    • 685269 [VMTest] fails on cyan-tot-chrome-pfq-informational. Chrome / ARC incompatible. Fixed on ARC side.
    • 686193 amd64-generic-telemetry failure in vmtest telemetry_UnitTests SimpleTestVerify, PlayActionTest.testPlayWaitForPlayTimeout and webservd crash. Flaky.
    • 686265 Frequent exceptions (timeout) on Linux ChromiumOS Tests (dbg). Flaky.
    • 686266 Chrome OS PFQ annotator marks passing PFQ runs as failed if chrome didn't need to update. Tool issue.
    Sheriffs; abhishekbh, adlr, kcwu
    Infra: dshi

    Resolved issues:
    • outage for builders ('module' object has no attribute 'RetriableHttp')
    Sheriffs: snanda, rspangler
    Infra: dgarrett

    • canary paygen failures with "no JSON object could be decoded"
    •, build315-m2, build318-m2 can't sync (makes trybots somewhat unreliable)
    •[cyan-chrome-pfq] [veyron_minnie-chrome-pfq] failed HWTest [arc-bvt-cq] (swarming timeouts?)
    Not resolved, but not on fire either:
    • should not cause autotest to complain that it lost connectivity to the DUT
    • one unresponsive DUT caused CQ to fail (dup of
    • several builders failed HWTest with "Android did not boot!" errors (may have 2 different root causes)
    • failed with "Couldn't resolve host ''"
    • step failure should report root error (lots of GS failures on canaries Wed night)
    • space left on device for kefka-release during paygenbuild 
    • long delay between tests complete & stage end during bvt-inline stage
    • what was chromeos4-row4-rack12-host15 doing between 14:06 and 14:30?
    Resolved Issues:
    • reef-paladin is failing with out-of-space error in rootfs (temp workaround in place; testing longer-term fix)
    • unittest stuck for 25+ minutes (temporarily markend auron-paladin not important)
    •[bvt-cq] desktopui_MashLogin Failure on x86-alex-release/R57-9163.0.0 (root cause likely and; just disable that test on old platforms)
    • whirlwind paladins are failing: lack of healthy DUTs? (restarted scheduler)
    • paladin is failing HWTest; DUTs in repair failed state (fixed bad switch in lab)
    •'class ash::ShelfWidget' has no member named 'SetShelfVisibility' (broke Chrome PFQ)
    • shard (chromeos-server42.cbf) is down
    • canaries keep failing at archive step

    Sheriffs: johnylin

    Resolved Issues:

    Sheriffs: dtor, martinroth, yhanada
    Infra: kevcheng

    Resolved Issues:

    Sheriffs: dianders, itspeter
    Infra: sbasi

    Resolved Issues:

    Sheriffs: sonnyrao, benchan, mtomasz
    Gardeners: stevenjb

        Ongoing Issues:
    • fail gsutil uploads with AccessDeniedException 403

    • build 2448 failed due to provisioning error

    • failure in

    • tests failing in canaries with no individual test logs

    • rejecting manifest pushes ("failed to lock" error)

        Resolved Issues:

    Sheriffs: jinsong, puthik, hungte

        Ongoing Issues:
    • None

        Resolved Issues:


    Sheriffs: mcchou, mruthven, ravisadineni

    Internal Waterfall:
       Ongoing Issues:

    • Lumpy provision failed due to Unhandled DevServerException: CrOS auto-update failed for host chromeos6-row2-rack7-host12
      (b/33185795 is filed for tracking the offline status of this bot.)

    • Paygen issue on arkham-release builder seems to be the reoccurance

    • lakitu-release builder GCEtest failed

    • guado_moblab-paladin failed at HWTest stage with moblab_RunSuite: FAIL: Unhandled AutoservRunError: command execution error

    • arkham-release builder failed at Paygen stage with cannot find source stateful.tgz error

    • veyron_speedy, wizpig failed at AUTest stage with image installation failure

    • and broke master-paladin

    • oak-release builder failed at HWTest stage with "(2006, 'MySQL server has gone away')" error

    • falco_li-release failed due to lack of DUT

    • sentry-release, Inconsistent propergation for the same test failures.

    • provision failure, Unhandled DevServerException: CrOS auto-update failed for host chromeos2-row3-rack1-host21

    • falco-chrome-pfq failed due to network issue
      b/33249596 P0 filed for syslab to troubleshoot

    • build_packages error due to authpolicy on x86-generic

       Resolved Issues:

    Public Waterfall:

       Ongoing Issues:

    Sheriffs: drinkcat, groeck, furquan

        Ongoing Issues:

    Please follow up on these, at least:
    • Lots of -paladin builders failures during ImageTest ( contains unsatisfied symbols). Had to pin Chrome.
    • VMTest in GCE instances?!
    • squawks pool:bvt unbalanced (please check what's going on?)
    • cros-beefy23-c2 out of disk space
    • guado_moblab: bad DUT
    • - Inadequate DUTs for falco_li
      • Maybe not be fixed in the immediate future (we are short on HW)
    • - wizpig/terra-release builders fail during HWTest: An operational error occured during a database operation: (2006, 'MySQL server has gone away')
    Less critical:
        Issues from last week:
    • - invalid oauth credentials. Some slaves were unable to retrieve images from google storage resulting in AUTest failures on the Canary waterfall.
    • - ssp picks random devserver.  Patches in place to mitigate.
        Resolved Issues:
    • terra-release. Bad DUT
    • - kevin-tpm2 keeps failing (jwerner has a fix)
    • - wizpig-release HWTest has been failing continuously for a few days.
      • Bad DUT
    • - glados-release SignerTest failure (should be fixed)
    • - pool: bvt, board: x86-mario in critical state (should be fixed)
    • - x86-{mario/alex}-{paladin/release/chrome-pfq} failure (also seems to affect other x86 3.8 boards like peppy/falco/lumpy/etc)
    • - veyron_minnie-android-pfq not running (builder offline)
    • - sentry-release experiencing test timeouts (probably duplicate of 666070)
    • - cros-beefy70-c2: Disk almost full, glimmer-cheets-release Paygen failures
        PFQ (gardening) issues:
    • None?
    Sheriffs: skau, ntang, pgeorgi
    Gardeners: jennyz

        Ongoing Issues:
    • - invalid oauth credentials. Some slaves were unable to retrieve images from google storage resulting in AUTest failures on the Canary waterfall.
    • - Bad DUT for guado_moblab-paladin
    • - Inadequate DUTs for falco_li
    • - sentry-release experiencing test timeouts
    • - ssp picks random devserver.  Patches in place to mitigate.

        Resolved Issues:
    • - x86-alex-paladin reports DUT unplugged. Actually, bad firmware CL in CQ.
    • - Not enough DUTs for buddy-release
    • - Lab restarted overnight. Caused 2 wedged slaves.
    • - Perceived lab slowness. Shard schedulers required restart.
    • - oak-paladin and reef-paladin failed due to bad restart of slaves
    • - peppy-release running client jobs as server jobs due to a bad image from devserver.
    • - No cyan boards for hw_video_acc_enc_vp8.  Misread debug message as error message.  Failure is expected.
    •  - Multiple canaries failing due to overnight ganetti restart.
    • - daisy_skate-paladins failing provision_AutoUpdate.double
        PFQ (gardening) issues:
    • None?

    10/31- 11/06
    Sheriffs: tfiga, dlaurie, yueherngl, semenzato (honorary)
    Gardeners: jamescook

        Ongoing Issues:

      • StageControlFileFailure due to DownloaderException
      • Canary runs fail with "DevServerException: stage_artifacts timed out"
      • related: Chrome LKGM is stale due to parrot-release failures
      • drone cannot connect to cloudSQL
      • login_Cryptohome fails nearly constantly on x86-generic-tot-asan-informational -> address space exhaustion on 32-bit Intel ASAN
      • b/32653128 - veyron_speedy-paladin constantly failing on an ARC++ related HWTest

          Resolved Issues:

        10/24- 10/31
        Sheriffs: kirtika, mka, deanliao, semenzato (honorary)
            Ongoing Issues on canaries
        • SetupBoard failure, last ~10 parrot canaries failed. 
        • Provision failure with error "Devserver portfile does not exist".
        • AUTest fails with kOmahaErrorInHTTPResponse (37)
        • No output from BackgroundTask for 8640 seconds
        • To look into: guado paladin caused consecutive master paladin failures on Friday
            PFQ (gardening) issues
        • New issues:
        • - Last AU on this DUT failed, The python interpreter is broken, completed successfully (happened once)
        • - HWTest security_SandboxStatus failed on elm and veyron_mighty paladin for two times.

        • Ongoing Issues:
        • - MobLab Failures in the CQ: dhcpd is not running. Crashing on shill restart (single occurrence)

        • Resolved issues:
        • b/32420834 -  Slow UI with 500 Internal Server Error on a CL with many comments (pre-cq-launcher failed to fetch the CL

        10/17- 10/23
        Sheriffs: cychiang, briannorris, semenzato (honorary)
        Gardeners: dshi, jrbarnette
                Ongoing Issues on canaries:
        • autoupdate_EndToEndTest, many different failures
        • autoupdate_Rollback
        • provision_Autoupdate.double
        • other provisioning failures (rsync errors, timeouts, error 37)
        PFQ (gardening) issues:
        • New Issues:
        • - lakitu cloud_SystemServices flakiness
        • - autotest-web-tests build errors are too opaque
          • Filed, noted a potential fix
        • - Not enough falco_li DUT in the lab.
        • - kunimitsu-release: build_packages failed on autotest-deps-ltp with undefined ltp_syscall, happen once.
        • - guado_moblab-paladin: moblab_RunSuite: FAIL: Unhandled AttributeError: '_CrosVersionMap' object has no attribute 'get_stable_version'
        • - celes-release, gandof-release: signing failed due to gsutil/ssl timeout
        • - pre-cq failed because nyan_freon is removed
        • - x86-mario-release: security_ModuleLocking timed out
        • - Falco device chromeos2-row4-rack5-host7 is flaky in provision
        • - multiple paladins: security_ptraceRestrictions: DUT rebooted during the test run.
          • Caused by bad CLs that made it through for
          • Poor Kernel 3.10 HW coverage:
          • Bad CL in 3.10 has been reverted, but still flushing out of some canaries (2016-10-20)
        • - Nearly all canary failed: paygen and AUtest fail to install device image.
        • - chell signing/paygen failing due to new kernel cmdline flag
        • - jetstream_LocalApi failure
        • - wolf + veyron_speedy DUT availability
        • - kunimitsu build failures
          • Still not resolved; there's no paladin?
        • Resolved Issues:
        • - Chrome PFQ manifest errors
          • Waiting for next PFQ runs to come through
        • build_packages fail on almost all release builders, some paladin builders.
        • - security_SandboxedServices failure "One or more processes failed sandboxing"
        • - canary build failure because of minijail tree change. uprev of ebuild chumped. Fix to security_SandboxedServices chumped.
        • - autotest-web-tests issues on guado_moblab-paladin (experimental)
        • root caused to libcups/icedtea-bin - fix is in flight
        • cave-release: Fail to resolve host name for cros-beefy19-c2
        • b/32292437 - DUTs in pool crosperf are all 'repair failed'
        • Need to push change to autotest shard.

        10/10 - 10/16
        Sheriffs: chirantan, julanhsu, kinaba
        Gardeners: lpique, dbehr

        PFQ (gardening) issues:
        •  New Issues:
        • - guado_moblab: Repair failing. Happened once, didn't reoccur
        • - falco-chrome-pfq failing since build 4821 with apparent network issues after updating. Filed after digging into one of the failures on falco, and noticing that in one case the infra didn't reconnect to the DUT after it was provisioned. Possibly related to where it falco becomes unpingable during provisioning.
        • - select_to_speak exists build error. Occurred once.
        • - Microcode SW error detected. Occurred once.
        • - [bvt-inline] security_SandboxedServices failure on lumpy-chrome-pfq (flake). "awk cannot open /proc/xxx/status" because the process ended between when the filename was generated and when awk tried to open it.
        •  Ongoing Issues:
        • [falco-chrome-pfq] almost always red
          • - provision failure "Device XXX is not pingable". This has plagued the falco-chrome-pfq builder, and is one of the main reasons we didn't automatically uprev Chrome this week.
        • [x86-generic-tot-asan-informational] almost always red
          • - login_Cryptohome fails nearly constantly on x86-generic-tot-asan-informational.
        • [ChromeOS Buildspec] red for M54 builds
          • - browser tests failing M54 builds on ChromeOS Buildspec builder. Landed a fix on the M54 branch that was made after the branch was cut, and was otherwise missed. For the builds to go green, we need a new M54 release though, since the builder pulls the current stable version release.
        • [Chrome4CROS Packages] always red
        • [lumpy-chrome-pfq] occasionally red
          • lumpy-chrome-pfq HWTest [bvt-inline] timed out waiting for json_dump. This is still happening, as the build time is too long occasionally. Added a note to the bug about certain tests taking much longer than the mean according to the gathered statistics when this occurs.
        •  Resolved Issues:
        • - Manually uprev Chrome to 56.8891.0.0 for Chrome OS. Since we otherwise would not have done so at all this week.
          • Actually there happened to be a green master run late Friday, for the first time in nine days.
        • - BuildPackages broken in multiple chrome-pfq builders. The CL for the fix landed and the builds were fixed Monday.
        • - (New) Media.VideoCaptureGpuJpegDecoder.InitDecodeSuccess not loaded or histogram bucket not found or histogram bucket found at < 100%". Caused failures on peach-pit. The fix landed early Thursday.

        10/3- 10/9
        Sheriffs: rajatja, denniskempin
        Gardenersihf, glevin
        • DebugSymbols error. Happens occasionally across boards:
        • AU Retry issues:
        • message_types_by_name error in dev_server:
        • buddy_release has been failing for weeks: need to investigate
        • gandof-release:
        • GSUtil timeout issues:
        • sentry-release: Some odd issues with HWTest need to investigate
        • bots failing graphics_Gbm check during hwtest

          PFQ (gardening) issues:
        •  New Issues:
        • - BuildPackages broken in multiple chrome-pfq builders.  There's a CL  for the fix, but it hasn't been committed yet.
        • - AboutTracingIntegrationTest.testBasicTraceRecording failing on x86-generic-telemetry and amd64-generic-telemetry.  CL to disable the test currently under review.
        • , , , - Autobugs for occasional HWTest provision flakes, mostly masked by 653900 since Thursday.
        • - falco- and tricky-chrome-pfq's failed w/timeouts during  Occasional flake, but no logs, no work done.
        • - lumpy-chrome-pfq HWTest [bvt-inline] timed out waiting for json_dump.  Flaked once, didn't recur.
        •  Ongoing Issues:
        • - Chrome4CROS Packages builder still broken (3+ weeks)
        • - Still happening on x86-generic-tot-asan-informational, with occasional successes slipping through.
        • - Occasional flake in PageLoadMetricsBrowserTest.FirstMeaningfulPaintNotRecorded
        • - HWTest[bvt-inline] : "security_NetworkListeners FAIL: Found unexpected network listeners".  Single flake, waiting to see if it recurs.
        •  Resolved Issues:
        • - [VMTest - SimpleTestVerify] failing on cyan-tot-chrome-pfq-informational : "Could not access KVM kernel module".  Reverted offending CL, builder green since then.
        • - Linux ChromiumOS Tests (dbg) failure of two DevToolsAgentTest.* tests.  Issue contains cause, revert, and subsequent fix.
        • - Linux ChromeOS Buildspec Tests failed intermittently for weeks.  Failure not seen since 10/7, when issue comment suggested that potential fix had landed.
        • - Multiple generic pfq builders failing with "Invalid ebuild name".  Fixed.

        9/26 - 10/2
        Sheriffs: dbasehore, akahuang
        Gardenersjdufault, glevin

        9/19 - 9/25
        Sheriffs: apronin, charliemooney, vpalatin
        Gardeners: stevenjb
        • chromiumos-sdk failed to build (missing efi.h) - fixed, build CL at fault CL to fix
        • Cyan has broken/flaky test performance in ToT, was causing CQ failures bug here
        • DataLinkManager crashing and breaking Canaries bug here (fixed: CL reverted)
        • Surfaceflinger crashing on oak bug here
        • Paladins fail to connect to MySQL instance bug here
        • Canaries were failing with "no attribute 'SignedJwtAssertionCredentials'" bug here (workaround CL submitted)
        • arc_mesa builds broken on auron, buddy, gandof, lulu, bug here, mostly fixed, buddy still fails as of buddy/428
        • manifest generation fails w/binary data in commit messages (e.g. CL:387905)
        • libmtp roll broke build packages due to autotools regen (fixed in CL:389031)
        • Root FS is over the limit for glimmer bug here
        • Reef builds were broken (unit tests failed to build), fixed here
        • Gru builds are broken (fail during uploading command stats) due to this CL, bug here, CL to fix
        • Some CLs are not marked as merged in Gerrit after a CQ run bug here
        • Tests that succeeded but left crashdumps frequently aborted on crashdump collection timeouts bug here, crashdump symbolication turned off if tests passed (here)
        PFQ (gardening) issues:
        • Chrome4CROS Packages builder failing in compile -
        • login_Cryptohome fails nearly constantly on x86-generic-tot-asan-informational -
        • login_OwnershipNotRetaken fails regularly on PFQ. -
          • Ongoing investigation
        • Shutdown crash in ~ScreenDimmer > SupervisedUserURLFilter::RemoveObserver -
          • FIxed
        • Several PFQ failures due to timeouts -
          • Some timeouts are triaged, but some still need investigation

        9/10 - 9/18
        Sheriffs: cernekee, kkunduru, chinyue

        9/5 - 9/9
        Sheriffs: jdiez, dhendrix, mcchou, josephsih
        Gardeners: achuith
        • Mostly having issues that affect many builders.
        • Canaries failing due to "HWTest did not complete due to infrastructure issues (code 3)", suspect b/31011610. May file more bugs...
        • Several builders failing due to misconfigured cheets_CTS test:
        • Kevin failing badly:
        • master-paladin infra failures (build 12292): this CL broke several paladin builds. Told the CL owner not to mark ready before fixing problems.
        • master-paladin infra failures (build 12294): failed 4 consecutive times. 20 paladins did not start in CommitQueueCompletion. Similar to build 12281 yesterday but build 12283 passed later.
        • provision_AutoUpdate.double ABORT: Timed out, did not run.
          • master-paladin infra failures (builds 12301, 12302): failed in these 2 builds
          • Looked similar to crbug/593423: Need to watch this as more builders were broken due to the timeout issue.
          • Build 12303 passed. Flaky?
        • signers failing while signing android apks:

        8/29 - 9/4
        Sheriffs: kitching, bleung, yixiang@
        Gardeners: michaelpg, afakhry
        • CQ paladin build #12207 failed due to whirlwind-paladin #5640 HWTest jetstream_ApiServerAttestation failing, but passes in #5641
        • CQ paladin build #12215 failed due to many repo sync errors (example: daisy_skate-paladin), looks like subsequent builds do not exhibit repo sync problems
        • CQ paladin build #12216 failed due to:
        • CQ paladin build #12218 failed due to "No room left in the flash" Vpalatin knows about it and looking for ways to make it fit. 
        • - Slave frozen, needed to be restarted.
        • - Timeout on Paygen curl /list_suite_controls (auron-release)
        • - Timeout on Paygen curl /stage (banon-release)
        • - Paygen suite job timed out despite all PASSED
        • - buddy-release: Paygen suite job timed out, all tests FAILED/ABORT
        • Top Issue on 8/31 - - lab database problem
        • b/31011610 - ATL14 packet loss bringing down ChromeOS Commit Queue
        • - guado_moblab broken due to testing outage
        • -  nyan_freon-paladin timed out during p2p unittest
        • - gru-paladin attestation unittest failure. Possibly flaky test. apronin@ looking at fixing test. Also affects gale-paladin
        • - All paladins failed during CommitQueueSync.  akeshet@ theory is that backlog of CLs (especially on kernel repo) overwhelmed GoB. akeshet@ put in a CL to temporarily limit CQ volume to 50 : TODO: Revert this once the backlog is cleared. nxia@ also added this mitigation :
        8/22 - 8/28
        Sheriffs: bhthompson, nya, walker
        Gardeners: jennyz, lpique
          8/15 - 8/21
          Sheriffs: benzh, sureshraj, yoshiki
          Gardeners: jamescook, domlaskowski
          • security_StatefulPermissions failures on canaries: 
          • provision_AutoUpdate.double failures on chrome pfq informational: 
          • SyncChrome failures due to "Repository does not yet have revision" on chrome informational pfq -> infra, ongoing flake
          • Chrome telemetry failures due to missing system salt file -> reverted
          • cyan chrome pfq informational builder cros-beefy191-c2 is out of disk space building chrome -> infra
          • pool: bvt, board: falco in a critical state -> infra
          • Chrome4CROS Packages builder failing in bot_update "fatal: reference is not a tree" -> infra
          • VMTest failing on telemetry bots due to telemetry_UnitTests_perf -> bug in test script?, disabled
          • cros amd64-generic Trusty builder failing to start goma in gclient runhooks step -> networking flake?
          • login_CryptohomeIncognito -> flaky, but real failure
          • cheets_NotificationTest failure on Cyan PFQ -> real failure in chrome (crash in shelf)
          • falco-full-compile-paladin has failed to start with exception setup_properties
          • x86-generic-tot-asan-informational failures in tpm_manager (odr-violation) and attestation (leaks) -> new target added to cros build that had failures, reverted
          • Kernel panics on Cyan PFQ -> ???
          • link-paladin BuildPackages failure with SSLError The read operation timed out
          • AUTest failed on most canaries due to no test configurations
          8/8 - 8/14
          Sheriffs: davidriley, vprupis, takaoka, smbarber (Mon afternoon only)
          • Continued UnitTest failures on canaries and release branches:
          • lakitu failures:
          • edgar missing duts:
          • kevin firmware prebuilt:
          • x86_alex and veyron_rialto pool health: and
          • Chumped change broke everything (eg pre-CQ, CQ, canaries) until revert was chumped in
          • infrastructure flake
            • celes-release/289, setzer-release/292 (build interrupted) ->
            • nyan-release/293, wolf-release/1294 (sudo access) ->
            • pre-cq (gerrit quota limits) ->
          • Friday: lab downtown affected builds for much of the day
          8/1 - 8/8
          Gardeners: stevenjb@, khmel@

          7/29 Notes for the next sheriffs from aaboagye, kirtika: 
          • Major issues we are seeing, format is <Impact: Issue: Links>::
            • Tree closure, fixed now: "No space left on device" for cheets builds: aaboagye@'s post-mortem here.
            • CQ failures: We've been seeing intermittent failures due to hitting git fetch limits with gerrit (commit queue sync step doesn't work). The current CQ run failed due to this, would not be surprised if the next one does too.
            • Several canaries failing: Unit-test times out, possibly due to overloaded machines:
            • Android-PFQ failures: adb is not ready in 60 seconds:
          • Minor issues, work-in-progress
            • Android-PFQ: mmap_min_addr not right on samus/x86:
            • Paygen/signing issues.
            • Autoupdate-rollback (likely network SSH issue): example

          2016-07-25 thru 2016-07-29
          Sheriff: aaboagye, kirtika, hidehiko (non-PST)

          • PST
            • Canaries
              • kevin-release was broken, but a fix is on the way. (wfrichar@ knows)
            • CQ
          • Non-PST:

          • PST
            • Canaries
              • Still seeing the error in the unittest phase. See
              • Paygen issue still affecting some canaries (x86_alex-he -
              • Saw a failure with auron_yuna canary with an error parsing a JSON response. See
              • samus failed with platform_OSLimits Found incorrect values: mmap_min_addr. Filed
            • CQ
              • Closed the tree because the CQ would just reject people's changes because of the no-disk-space error.
            • Chrome PFQ
              • Still seeing some failures in the login_CryptoHomeIncognito test. See
          • Non-PST
            • CQ:
              • RED.
              • samus-paladin is failing due to no-disk-space error.
              • cheets tests are failing two times with actual error ( Being fixed.
            • Chrome PFQ:
            • Android PFQ:

          • PST
            • Canaries
              • Seems like nearly all the canaries failed during HWTest stage apparently due to Infra issues.
            • CQ
              • On one run, some of the paladins failed during the CommitQueueSync step due to git rate limiting.
            • Android PFQ
              • An overloaded devserver is causing provisioning to fail for cyan-cheets-android-pfq and veyron_minnie-android-pfq (wolf-tot-paladin too).
          • (Non-PST)
            • CQ:
              • Master paladin looks flaky due to various reasons.
                • CQ limit hitting
                • HWtest time out
                • kOmahaErrorInHTTPResponse: looks a tracking issue. 
              • These look not always reproducible, and some runs pass successfully.
            • Chrome PFQ:
              • Finally passed at #3175.
            • Android PFQ:
              • Failing in latest several runs. Though the reasons are variety. Looks just too flaky.

          7/26 (18:20 PST)
          • Canary Failure Classification: Lots of canary failures (~50%) this afternoon, so listing unique causes here to track down tomorrow: 
            • x86-zgb: Pool-health issue, infra (kevcheng@) looking into it, may be back up next canary run? 
            • x86-mario: Not sure if the manifestversionedsync is a real issue or not, filed anyway. 
            • Paygen failures: falco, falco_li, gru, jecht, kip, lumpy, ninja, parrot, peppy, samus, smaug, x86_alex-he, stumpy. TBD: Update more details here. 

          • (PST)
            • Canaries
              • Still some errors on nyan_blaze and nyan_kitty caused by the vboot_firmware CL.
                • Fixes posted to gerrit and making it's way through the CQ.
              • Still some unittest failures. There's a CL that just landed to reduce the parallelism. Will be following to see if the situation improves.
                • That CL did not seem to resolve the issues.
              • Saw a few canaries yesterday (celes this morning) that had issues when uploading debug symbols. dgarret@ is working on a fix.
              • security_StatefulPermissions is pretty flaky, veyron_minnie canary failing on it. wmatrix is all red: Investigating
              • There was canary failure on lars-release which reported all the DUTs in the pool as dead, but they seem to be up now.
              • x86-zgb pool health is poor - most devices down. kevcheng@ taking a look.
              • Towards the end of the day, a larger number of canaries were failing at the paygen step. I think what may be happening is network flakiness, but I wonder why we don't just retry again?
            • CQ
              • panther_embedded-minimal-paladin has been down for quite some time now. Pinged the bug to see if there are any updates.
                • A restart of the master has been scheduled. Need to check back later today if that fixes things.
              • No elm devices in pool:cq making elm-paladin fail. kevcheng@ taking a look. No bug yet. 
            • Android PFQ 
              • harmony_java_math CTS test is causing failures with its causing android-pfq failures "cts test does not exist".  Filed b/30413761. Ping ihf@ if it doesn't get better. 
            • Chrome PFQ 
          • (Non-PST)
            • Canaries
              • platform_FilePems issue was fixed by yusukes@.
              • Investigated a bit more about UnitTest failure. Not yet reached to root cause.
            • CQ
              • Looks flaky: Sometimes failing ErrorCode=37 (OmahaErrorInHTTPResponse).
            • Chrome PFQ:
              • Looks flaky. Sometimes failing due to login error, but there is variety of failing boards.

          • Canaries
            • Several of the canaries were failing in the platform_FilePerms HwTest.
              • This was seen on cyan, elm, lulu, oak, samus, and veyron_minnie.
              • Appears to be missing expectations for ARC containers.
              • Filed
            • The unittest stage seems to be timing out somewhat fairly often now.
            • nyan-big is failing on a vboot_firmware CL not building. Filed Fix is in CQ now. 
          • CQ 
            • Generally okay today. There was one issue regarding a failure in VMTest, but that was caught.

          2016-07-18 thru 2016-07-24
          Sheriff: wuchengli

          • 628990: DebugSymbolsUploadException: Failed to upload all symbol
          • 593461: Chrome failed to reach login screen within 120 seconds
          • 628494: chromeos-bootimage build failures in canary builds
          • 609931: 'chromite.lib.parallel.ProcessSilentTimeout'>: No output from <_BackgroundTask(_BackgroundTask-5:6:7:3, started)> for 8610 seconds
          • 629094: cannot find source stateful.tgz

          OLDER ENTRIES MOVED TO THE ARCHIVE so this page doesn't take forever to load.  See Sheriff Log: Chromium OS (ARCHIVE!)