Sheriff Log: Chromium OS (go/croslog)

Please update go/cros-sheriff-playbook when you find a build/infra failure and can map it to what action the sheriff should take for it.

Sheriffs: benchan, nsanders, hiroh

Ongoing issues
  • 784462Provision failure spike in the lab
    • (Duplicated) 784222: PaygenTestDev failed on multiple canary builds
  • 784225: TestLabException: Not enough DUTs on Chrome-PFQ, Android-PFQ and canary build
  • 784686: veyron_rialto-paladin failed at BuildImage staging due to package: chromeos-base/telemetry
  • 786159: ImportError: No module named lockfile
  • 786159: HWTest failed due to INVALID_OPTIONS
  • 786159: AFE is down: google-sso enforced a new config requirement, breaking our apache servers
  • 786167: auto-update failed with StatefulUpdateError
  • 786395: CQ master failed to push a change with 'git log' errors
  • 786487: reef-uni-paladin failed due to no valid hosts for board:reef-uni
  • 785552: provision failures: DUT cannot recover from reboot at post check of rootfs update

Sheriffs: puthik, ddavenport, cywang

Resolved issues
  • 782509video_ChromeHWDecodeUsed mse tests are failed because is broken down.
  • 781845: desktopui_ScreenLocker failing on amd64-generic and betty
  • 781302: slow queries on shards | chromeos-server98 and 104 tick rate is really low
  • 783312: video_ChromeHWDecodeUsed failing on tricky, caroline, lumpy, peppy
  • 781852: CQ failure when there are no CLs in the CQ run
  • 783449: unittest flake in autotest_lib.site_utils.lxc.container_pool.client_unittest.ClientTests.testConnection
Ongoing issues
  • 776997: cheets_StartAndroid.stress failes and chrome / kernel crashes
  • 783832: cheets_StartAndroid.stress timeout

Sheriffs: teravest, justincarlson, cywang
  • 782509: widespread Media.GpuVideoDecoderInitializeStatus not loaded or histogram bucket not found or histogram bucket found at < 100%" - the root cause is "404 in". hiroh@ is helping to make a workaround to redirect requests to temporarily.
  • 782577incorrect dependencies of media-libs/arc-camera3-libcamera_jpeg (Fixed)

Sheriffs: teravest, justincarlson, fukino
  • 777920[kernel 3.18] veyron_speedy provision failure: USB enumeration of ethernet adapter fails with "can't set config #1, error -71"
  • 768542: DUT fails to bring up USB ethernet adapter after reboot in provision (chromeos kernel 4.4)
  • 779583: General Protection Fault in kernel-list_move_tail called from i915
    • Causes graphics_Idle failures
  • 780515: daisy_skate-release:1910 failed
    • Paygen failures
  • 780045: BuildPackages failing to build chromeos-chrome
    • This should be resolved, but keep an eye on the next goma update.
  • 780503: cave-release:1635 failed
  • 765686: wizpig-paladin Provision failed: Post-provision check for "system-services" being "start/running" can fail
    • This needs more attention and debugging.

Sheriffs: akahuang, jinsong, mruthven
  • 777250HWTest failed to provision on peach_pit and veyron_minnie, let Chrome gardener to triage
  • 776919: lakitu-gpu, lakitu, lakitu paladin failed at build_package, should be fixed by CL:735061 and CL:737773
  • 766259: buildstart stage failing with IntegrityError, a flaky failure.
  • 777829: Most paladins raised exception "process killed by signal 9"

    Sheriffs: groeck, xiaochu, fukino, tetsui
    • 775872: M64: Cyan, Eve, Kefka, Samus build is RED for 4 days

    Sheriffs: jclinton, furquan, posciak
    • 773185: All Chrome PFQ bots failing starting from 63.0.3237.0 due to a syntax error in DEPS
    • 772568: lumpy, peppy, tricky Chrome PFQ failures in vmtest; manual uprev via 773446

    Sheriffs: ntang, djkurtz, phobbs
    • 771396: Lab DNS failure caused wide spread master-paladin filaure.
    • 771236: Provision failure due to version '9999'
    • 772582: Puppet run may interrupt the ssh_config and causes ssh conntection failure.
    • 770778: A few cases of shard apache process death, which needs alerting.
    • 770865: Shard db inconsistent with master db causes shard_client crashloop
    • 770715:  Quite a few graphics_drm failure (fixed).
      Sheriffs: chinyue, vbendeb, mxt
      • 769099autotest-server & autotest-web-frontend circular dep
      • 769334betty-arc64-paladin failed VMTest
      • 768280: build_image run out of space

        Sheriffs: puneetster, amstan, 

        Sheriffs: yueherngl, seobrien, josephsih
          • 762579p2p_ShareFiles fails - "Expected exported file ...". This fixed the flaky test failure of nyan_kitty-paladin.
          • 765145: CommitQueueCompletion: The master destructed itself and stopped waiting for the following slaves
            • A series of 7 such continuous CommitQueueCompletion errors.

          Sheriffs: jintao, tbroch, yamaguchi
          • 763290: MasterSlaveLKGMSync failed on 3 devices in nyc-android-pfq
          • 762883: veyron_mickey-release:1470 failed
          • 762865: crostestutils.au_test_harness.au_test.AUTest failing at SimpleTestVerify
          • 762826: HWTest [sanity] failing on Canary for many boards by "No JSON object could be decoded"
          • 762812: master-paladin failing by "losetup: could not find any free loop device"
          • 762525: AUTest flaky by autoupdate_EndToEndTest_npo_delta_9915.0.0 timeout/abort
          • 762400: autoupdate_EndToEndTest.paygen_au_dev_full flaky on multiple canary builds
          • 762393: banon-release:1466 failed: timeout reached running bvt-arc on 3 DUTS
          • b/65478658: gsutil flaking during upload with 412 Precondition Failed
          Sheriffs: mka, walker, kitching
          • 758247: Issue with EXPERIMENTAL keyword in tree status
          • 757510: veyron_tiger-android-pfq:608 failed: Multiple Android versions: set([u'4289255', '4287917']) - ignore, M builders will be removed eventually (b/64821099)
          • 755060: Eve trackpad FW regression possibly still causing issues with bvt DUT(s) in the lab (the DUT in question is working again)
          • 759450: HWTest timing out in relm-release builds
          • 759976: repeated ADB timeouts on eve-release
          • 759977: relm-release failing due to lack of BVT DUTs in lab (required: 4, found: 3)
          • 760011: enguarde-release failing due to lack of BVT DUTs in lab (required: 4, found: 3)
          • 760016: poppy-release builds failing due to swarming timeout
          • 760254provision_FirmwareUpdate is devouring shard inodes
          • 760314Chell PFQ build failed (gsutil failed)
          • 757943Devservers are having trouble (intermittently?) resolve DUT names
          • 760739: Runaway processes on chromeos-skunk*
          • 760789: whirlwind: missing include file "ap-daemons/vorlon/client/dbus-proxies.h"
          • 760843: /b/c/cbuild/repository/chroot/usr/bin/xz: /lib/x86_64-linux-gnu/ version `XZ_5.2' not found
          • 761169: OperationalError: (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 0")
          • 761271: PaygenBuild* steps failing due to missing bspatch binary
          • 761422: Archive step failed: timeout while copying artifacts
          • 761513: bluestreak-pre-cq sanity build failure
          • 761471: Flaky cheets_StartAndroid.stress in veyron_minnie
          Sheriffs: briannorris, apronin
          • 757351: HWTest failed on most of build
          • 757510: veyron_tiger-android-pfq:608 failed: Multiple Android versions: set([u'4289255', '4287917'])
            • "Just ignore it"
          • 757599, 757658: moblab_RunSuite fails on guado_moblab-paladin
            • Bad CL landed in moblab tests last week. Reverted.
          • 757824: chromiumos-sdk failing to set up chroot
            • Reverted CL. Investigation continued on 757147
            • Investigation / re-landing to be continued on original bug: 756240
          • 757866 (-> 755914): testReapAllInOrder: AssertionError: False is not true
            • Relatively new, flaky test; failed at least twice on canaries today
          • 756957: amd64-generic-asan failing in build_packages
            • Autotest eclasses weren't packaging the tarball right; CL:634627 is in flight for fixing this
          • x86-generic - builders to be deleted from waterfall (729645):
            • The builders are to retire soon - ignore the errors.
            • 757929: x86-generic-full: build_packages: media-libs/mesa: econf failed: LLVM target 'amdgpu' not enabled
            • 757934x86-generic-incremental: build_packages fails: Cannot find prebuilts for chrome
          • amd64-generic-tot-asan-informational failures:
            • 748216bluez ASAN failures heap-buffer-overflow and global-buffer-overflow
            • 757921: cryptohome ASAN failure: detected memory leaks in libchaps (cryptohome_testrunner)
          • 757958: provision_Autoupdate.double failures on reef-paladin
          • 758036: nyan_kitty-paladin:2691 failed
          • 757943 (-> 712682): Devservers are having trouble (intermittently?) resolve DUT names
          • 758251: coral-paladin failing build-packages
            • bad chump. Reverted.
          • b/64842314: coral release builder failing firmware signing
          • b/65013136: guado_moblab-release fails paygen
          • 758665: login_MultipleSessions failed on link-paladin
          • 759039: cheets_PlayStoreTest fails with Unhandled TimeoutException: Timed out while waiting 20s for IsJavaScriptExpressionTrue.
            • Revert bad CL
          • 759093: CQ submitted a change via strategy:cq-submit-partial-pool-cq-history that broken HWTest on multiple platforms
            • CQ really shouldn't have accepted the bad CL from bug 759039

          Sheriffs: tfiga
          • 755051: CrOS Git commands fail because Git bundle is missing PERL regexp support. (e.g. PublishUprevChanges on master-paladin)
          • 755060: Eve bvt DUTs in the lab timing out
          • 755080: daisy_skate-release builders frequently hang in test server job (dup into 736393)
          • 755461cheets_StartAndroid.stress (and other) tests failing on Intel boards due to graphics driver timeout when restarting chrome
          • 755470: lakitu-release/lakitu-gpu-release failing cloud_KonletStartup test
          • CL:608907: file collision due to missing blocker breaks CQ
          • CL:614310: ec-utils, coreboot breakage due to eclass change
          • 755699: peppy-paladin: desktopui_ExitOnSupervisedUserCrash: Timed out waiting for condition: Session stopped.
          • 755843: gandof-release: Tests failing likely due to DUT malfunction
          • 755882: reef-uni-release: Not enough DUTs for board: reef, pool: bvt-uni; required: 4, found: 3
          • 755906: canary: Big series of build timeouts
          • 755914: poppy-release: chromite UnitTests flaky?
          • 755917: poppy-release: chromeos-ec UnitTests failure (dup of 715011)

          Sheriffs: dbasehore, jkwang, hidehiko
          • b/62087733: daisy_spring devices in the lab are having issues
          • 751285: many paygentest[canary, dev] failures on specific boards like quawks
          • 751315: arc-networkd causing security_SandboxedServices failures on newbie
          • 751895: Issue causing moblab provisioning to fail
          • 752176: X11 libva file collisions breaking the build
          • CL600543: Blew up the builder
          • 752656: Timeouts on AUTest for some canaries
          • 751762: Temp short cut for CtsAccountManagerTestCases
          • 752269: jetstream_BluetoothBeaconing failure
          • 752562: Temp fix for

          Sheriffs: wuchengli
          • 744212: autoupdate_EndToEndTest.paygen_au_dev_delta: Unhandled DevServerException: CrOS auto-update failed
          • 744569: Flaky PreCQLauncher GOBError
          • 743292: Cannot build gralloctest for eve-paladin
          • 746230: cheets_CTS_N.arm.CtsAppUsageHostTestCases failures
          • 746327: ASyncHWTest is forgiven
          • 746336: TestLabException: Not enough DUTs for board: zako
          • 746347: Sometimes autoupdate_Rollback needs retry to pass
          • 653048: cheets_KeyboardTest fails with "Timed out waiting for condition: Expected text entered"
          • 746548: Android boot test fails for strago boards
          • 746808: Paygen failure: CatFail AccessDeniedException 403
          • 746814: BuildPackages failed in chromeos-base/cryptohome.
          • 747211: Canary passed but no payload was produced
          • 747254: Chrome PFQ failing -- Wrong file version in settings_resources.pak
          • 747278: reks: SSHConnectionError: Connection to timed out

          Sheriffs: cernekee, sjg
          • turned out to be due to the toolchain binutils uprev from last week (e.g. breaking auron_paine). This was reverted and many builds became green.
          • Filed various bugs for broken builders. Seems like we should look into update_manager (e.g.
          6/19 ~6/23
          Sheriffs: kirtika, pmalani

          State of the test lab (6/19)
          • Chrome pinned due to a bad CL (not yet identified) that brought down kitty & blaze last week. Partly happened because we don't have any coverage for nyan on the PFQ. (
          • hwtest lab will be shutdown tonight and tomorrow night. 
          • kirtika@ to do a kernel merge (6/20 afternoon) - Intel wifi driver drop.  
          • Infra team: 4+ bugs to be filed for servo and repair (1 specific to servo on kitty, 3 others about how the servo repair process failed in the case of the kitty errors). 

            6/5 ~6/11
            Sheriffs: bleung, mcchou, mojahsu
            • 729766: autoupdate_EndToEndTest.paygen_au_canary_full, on a bunch of boards. Suspect AU from R53 8530.96.0
            • 730272: hostinfo attributes refer to incorrect job_repo_url, causing tests to fail
            • 731253: Skylake and Elm CQ failed due to drm patch from Intel. Error message is misleading, suggesting a failure in the infrastructure rather than a HWTest system that legitimately did not boot with the new image. was filed to improve the messaging.

            Sheriffs: benzh, vprupis, seanpaul
            • 727685: all bots failing in SyncChrome stage
            • 729016: many bots failing in TestSimpleChrome with clang++ bad indent
            Sheriffs: aaboagye, grundler, hashimoto
            • 720192: beaglebone-release is offline since Apr 20
            • 723645: CQ failures [elm, cave?]: HWTest failed due RPC layer timeouts
              • Most recently seen on guado_moblab-paladin
            • 725152ec-utils broken for bob
            • 708679: Some shards stop taking RPC calls.
              • This has manifested in boards appearing as "repair failed" (e.g. - buddy bvt DUTs)
            • 718083: x86-generic-asan builder has been failing. May need to be removed since x86 machines are EOL'd.
            • 677293: Re-occurrence of a failure when running `git clone`.
            • 725586: banjo, banon, cave, dasy_spring, parrot_ivb, and zako, failed HWTest this morning due to not enough DUTs in the BVT pool.
            • 725856: authpolicy build failure when USE=-cros-debug (broke all release builders)
            • 726134: Many canaries failing with "SSL connection error: ASN: bad other signature confirmation"
            • 715011: nvmem ec test crashes resurfaces
            • 726383: bob-release failed cheets_StartAndroid
            • b/62087733: Broken daisy_spring DUTs.
            • 714330: guado_moblab-paladin fails a lot with "host did not return from reboot".
              • It has been moved to experimental, but still needs to be root caused.
            • 726757: Many canaries failing in chromeos-base/authpolicy unit test.
            • 726835: gale-relase: BuildPackages failure in sys-boot/depthcharge
            Sheriffs: davidriley
            • 722599: provision failures due to bad prod push
            • 722961: cave (and celes/lars) HWTest stages slow to run, each test running slow
            • 723645: AFE RPC timeouts causing HWTest to fail
            • 723026: Chrome puppet roll out caused git failure
            • 723964: LXC artifact download issues
            Sheriffs: rbhagavatula, jwerner, hirono(non-PST):
            • 719342: guado-release has been broken for a month
            • 721855: Broken DUT chromeos2-row7-rack10-host11 caused elm-release failures
            • 719786: Seems that RootfsUpdateError and paygen issues on Braswell boards were caused by a kernel crash triggered by cbmem
            • 720087: AFE outage caused all paladin HWTests to time out
            • 720005: Swarming outage killed all HWTest with "Waiting for results from the following shards: 0 N/A: 3606683bbab9bf10 None"
            • 689105: Lots of autoupdate_EndToEndTest.paygen_au_* failures.
            • 717746: cheets_StartAndroid.stress Failure on bob-release/R60-9516.0.0. Tracking in the ARC++ issue entry.
            • 719342: security_AccountsBaseline consistently failed on guado since Apr 6

            Sheriffs: dnojiri, sduvvuri, hychao(non-PST):
            • 717061: some shards are missing lxc, failing ssp container setup
            • 716913: Upgrade openvpn package to v2.4.1
            • 718355: Packages failed in ./build_packages: chromeos-base/autotest-tests-cheets

            Sheriffs: chadversary, oliverwen, mnissler(non-PST):
            • 710492: TPM2 does not work inside VMTest: eve-pre-cq VMTests are failing, apparently unrelated to CLs being tested. CLs are blocked.
              • As potential fix, chumped CL:477090 eve: don't run VMTests in Pre-CQ
            • 715855: buildbot timeout in BuildPackages sentry-release on autotest-tests-cheets-0.0.1-r375.ebuild
            • 714571: HWtest invocation times out
            • 714598Image signing step fails on release builders
            • 714601: logging_UserCrash fails on x86-generic-incremental
            • 714608: vmtest timeout on ASAN bots
            • 714451: sumo/ninja failing to sign due to maxcpus=2 [FIXED]
            • 715011: nvmem ec test crashes
            • 715066: HWtest failure return code -9 / code 247, but all tests pass
            • 715108: Build step failures (code 1) with missing AFE output
            • 715012: paladin failed HWTest stage due to post-suite JSON decode on chromite side | ValueError: No JSON object could be decoded
            • 716399PFQ builders fail TestSimpleChromeWorkflow with "Could not run pkg-config."
            • 716412: UnitTest failures on amd64-generic-asan

            Sheriffs: tbroch, zhihongyu, owenlin(non-PST):

            • 697274: daisy-skate-chrome-pfq: running hwtest but cq doesn't. CL to remove hwtest from pfq
            • CQ:14360: arc-camera breakage -> reverts 14364
            • 713531: security_SandboxStatus fails.  Remove from bvt-inline for now
            • 713004: Tests passed but got aborted by AutotestAbort
              • 713856: network load likely suspect. 10x increase in file size since 4/12
              • 4/20: unpin chrome not clear it can be blamed
              • 4/19: pin chrome to 59.0.3064.0_rc-r1 as workaround
            • 713226: boost, python-gflags, other pkgs changed on mirror mismatch manifest.
            • 712679: canary builders failing : long build time for chromeos-chrome
            • 712297: chrome PFQ failures, goma enablement side-effect for TestSimpleChromeWorkflow
            • 712102: veyron_minnie-chrome-pfq bots look full-disk.
            • 712109: cyan-chrome-pfq is failing due to libstdc++ version mismatch.
            • 685889: (dup) veyron_mighty_paradin, winky_paradin failed due to (IntegrityError: "Duplicate entry for key 'buildbucket_id_index'")
            • 712505: Fizz Paladin: Failed steps failed cbuildbot: failed androidmetadata
            • 689105: /usr/bin/python: bad interpreter: No such file or directory in autoupdate_EndToEndTest
            • 697967: ASAN failures : no space left during build_images

              Sheriffs: shchen, philipchen, itspeter(non-PST):
              • 689105: /usr/bin/python: bad interpreter: No such file or directory in autoupdate_EndToEndTest
              • 708679: Some shards stop taking RPC calls
              • 708715: Frequent pre-cq failures on caroline.
              • 708429:4/5 : suspect pre-cq-launcher has permission issue, Closing the Tree
              • Suspect CL:465488 breaks pre-cq-launcher #8925, revert and restart pre-cq.
              • 707696: itspeter@ believe Builder master-paladin Build #14172 is marked incorrectly. It should be green based on issue.
              • 707629master-paladin failed continuously, suspect a slave is missing python package but not able to investigate further. Looks flaky as it passed on guado_moblab-paladin #5520
              Sheriffs: pberny, norvez, wnhuang (non-PST)
              • 689105 /usr/bin/python: bad interpreter: No such file or directory in autoupdate_EndToEndTest
              • 703914 platform_MemCheck is flaky => flaky test
              • 699353 desktopui_ScreenLocker FAIL: Unhandled DevToolsClientUrlError => flaky test (Chrome crash)
              • 703789 graphics_Gbm: DUT Rebooted unexpectedly nyan_kitty. => flaky test
              • 690307 swap shard workload => Fixed
              • 704669  (resolved) Reef derivative canaries have broken linux firmware
              • 704381 "Report" build step doesn't time out refreshing access tokens for gsutil
              • 704194 (Two) Asuka devices not coming up after reboot during AUTest
              • 705247 android image signing failing due to out of space

              Sheriffs: smbarber, hoegsberg, shunhsingou (non-PST)
              • 701400 Repair flow no longer working for guado_moblab
              • 701693 SSH connection fails for veyron_speedy-paladin/veyron_mighty-paladin
              • 689105 /usr/bin/python: bad interpreter: No such file or directory in autoupdate_EndToEndTest

              Sheriffs: leecy, scollyer (PST), littlecvr (non-PST)
              • 699353 desktopui_ScreenLocker FAIL: Unhandled DevToolsClientUrlError
              • 698825 caroline gets canceled because the build takes too long to finish
              • 700021 gsutil issue in container setup causes "missing lockfile" failure
              • 695287 Slowness and 502 errors from cautotest AFE because of cautotest mysql slowness

              Sheriffs: moch, marcheu (PST)
              • 696606 devserver load may contribute to some provision failures
              • 696696 desktopui_MashLogin | FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run.
              • 698096 some canaries are running out of time
              • 694081 ARC availability check
              • 693610 tko_parser error
              • 694642 missing autoserv logs
              • 690822 CTS scheduling
              • 694755 chromeos.branch dying
              • 695172 cyan-chrome-pfq stuck
              • 695733 chrome re-pin
              • 695641 pre-cq-launcher failures due to oauth token invalidation
              • 695529 excessive provisioning errors
              • 696039 several jetstream flakes
              • 695940 kevin FW re-update
              • 639301 cyan stuck on shutdown
              Sheriffs: ejcaruso, mqg (PST), adurbin (non-PST)
              Infra: shuqianz
              • Generally swarming issues and network problems have been a huge problem this week.
              • reef, snappy, and pyro release builders were all marked important on 2/14
              • 693734 guado_moblab: AndroidMetadata failure; no ebuilds to satisfy "x11-base/xorg-server"
              • 693691 falco-release: suite timeouts (maybe network related? logs are bad, this is also happening on other boards)
              • 693597 nyan_kitty: CQ test failure
              • 693331 nyan_kitty: all CQ DUTs failed to provision
              • 693318 peppy: generic_RebootTest failure
              • 693313 breakpad compile failure from -Werror,-Winconsistent-missing-override
              • 693310 guado_moblab: broken CL made it past the CQ somehow
              • 693101 lab DHCP server configuration update took out the whole lab
              • 692342 kevin: provision failure loops (possible eMMC failures?)
              • 692236 falco_li: not enough DUTs to test canary
              • 692232 peppy: failed to provision
              • 692214 caroline: canary paygen issues
              • 692206 clapper: VMTest broken
              • 692129 snappy: no good repair build (unstable ToT)
              • 691729 kevin: unable to reach devserver
              • 690616 caroline: failed to perform stateful update (continued from last week)
              • 690286 reef: cs50-updater causing reboots and rollbacks (continued from last week)
              • 690232 candy: dbus issues causing canary failures (continued from last week)
              • 692240 setzer was moved between servers, resulting in some planned throttling
              Gardener: jennyz
              • 692247 falco-chrome-pfq, daisy_skate-chrome-pfq: failed to connect to DUT after AU
              • 687248 falco-chrome-pfq: flakiness in provisioning prevents chrome uprevs (continued from last week)

              Sheriffs: jinjingl, waihong
              • 691009 daisy_skate CQ: Devserver call failed: "" => Restarted devserver.
              • 690616 Coreline canary: Failed to perform stateful update
              • 690232 Candy: The name org.chromium.UpdateEngine was not provided by any .service files
              • 690286 no green build for reef family
              • 689794 samus-android-pfq failing HWTest - CrOS auto-update failed
              • 689694 CQ Failing Gerrit Unittests - gaierror => Fixed the test
              • 689105 multiple autoupdate_EndToEndTest failures at about 6:40 => Reverted CL
              • 689072 build_image failing again in canary archive step with cryptic error => Reverted CLs
              Gardeners: michaelpg, glevin
              • frequent falco-chrome-pfq failures. Suspect DUT replaced, but issue ongoing.
              • VideoPlayerBrowserTest.OpenSingleVideoOnDrive still flaky.
              • (fix in review): LKGM builder fails 50% of nights. Uploaded CL, ensured a run succeeded, and updated YAQS.
              • (resolved): depot_tools CL breaks SyncChrome step on canaries and PFQ; quickly reverted by dpranke@
              • (resolved): linker failure on amd64-generic-tot-asan-informational
              • (resolved): piex_loader.js is noisy in chromium browser_tests
              • various (resolved): flaky tests on Linux ChromiumOS. CLs reverted.
              • ketakid: PFQ failure for samus on 57 branch (fix here)
              Sheriffs: uekawa
               - stateful.tgz missing from caroline dev release builds. -- manually fixed
               - devserver down due to disk full, cleanup script wasn't running due to manifest. -- resolved and pushed.
               - dhcp outage caused lots of ssh connection timeout. -- should be resolved.
               - lakitu-paladin failing with GS upload failure. -- ACL was fixed.
               - lakitu-gpu-incremental has never succeeded -- a change went in.
               - falco-chrome-pfq failure. -- tried locking 
               - signers timing out
               - libinstallattributes failing with asan build. now fails with another failure.
               - seems to be failing all builders now with failing to uprev, what!?

              1/23 - 1/25
              Gardeners: jamescook, warx
              • 683977 git lockfiles breaking chromeos amd64-generic Trusty builder. Resolved.
              • 683640 FAIL: Test did not complete due to Chrome or ARC crash. Java version issue. Disabled.
              • 684044 "All devservers are currently down" - incorrectly blaming *all devservers* when a single devserver call flakes. WontFix.
              • 683304 falco-chrome-pfq failures. Infra / test problem. Fix in flight.
              • 674209 constant video_ChromeHWDecodeUsed failures in tricky/peach_pit informational pfq. Reverted.
              • 685313 linker failure on chromeos asan in libbrillo, "may overflow at runtime; recompile with -fPIC". Toolchain? Still failing.
              • 685340 chromeos Chrome LKGM builder failing in cros_best_revision, git cl land failure. Flaky. Infra?
              • 685424 scheduler: Aborting large number of bvt-prebuild request from past canary causes slowdown, CQ failure. Ongoing.
              • 683828 Chrome compile failure, openh264 cpu architecture. Reverted.
              • 685675 Manually uprev Chrome to 58.0.2993.0 for Chrome OS
              • BuildPackages failure due to camera HALv2 autotest (not chrome), Reverted.
              • 685269 [VMTest] fails on cyan-tot-chrome-pfq-informational. Chrome / ARC incompatible. Fixed on ARC side.
              • 686193 amd64-generic-telemetry failure in vmtest telemetry_UnitTests SimpleTestVerify, PlayActionTest.testPlayWaitForPlayTimeout and webservd crash. Flaky.
              • 686265 Frequent exceptions (timeout) on Linux ChromiumOS Tests (dbg). Flaky.
              • 686266 Chrome OS PFQ annotator marks passing PFQ runs as failed if chrome didn't need to update. Tool issue.
              Sheriffs; abhishekbh, adlr, kcwu
              Infra: dshi

              Resolved issues:
              • outage for builders ('module' object has no attribute 'RetriableHttp')
              Sheriffs: snanda, rspangler
              Infra: dgarrett

              • canary paygen failures with "no JSON object could be decoded"
              •, build315-m2, build318-m2 can't sync (makes trybots somewhat unreliable)
              •[cyan-chrome-pfq] [veyron_minnie-chrome-pfq] failed HWTest [arc-bvt-cq] (swarming timeouts?)
              Not resolved, but not on fire either:
              • should not cause autotest to complain that it lost connectivity to the DUT
              • one unresponsive DUT caused CQ to fail (dup of
              • several builders failed HWTest with "Android did not boot!" errors (may have 2 different root causes)
              • failed with "Couldn't resolve host ''"
              • step failure should report root error (lots of GS failures on canaries Wed night)
              • space left on device for kefka-release during paygenbuild 
              • long delay between tests complete & stage end during bvt-inline stage
              • what was chromeos4-row4-rack12-host15 doing between 14:06 and 14:30?
              Resolved Issues:
              • reef-paladin is failing with out-of-space error in rootfs (temp workaround in place; testing longer-term fix)
              • unittest stuck for 25+ minutes (temporarily markend auron-paladin not important)
              •[bvt-cq] desktopui_MashLogin Failure on x86-alex-release/R57-9163.0.0 (root cause likely and; just disable that test on old platforms)
              • whirlwind paladins are failing: lack of healthy DUTs? (restarted scheduler)
              • paladin is failing HWTest; DUTs in repair failed state (fixed bad switch in lab)
              •'class ash::ShelfWidget' has no member named 'SetShelfVisibility' (broke Chrome PFQ)
              • shard (chromeos-server42.cbf) is down
              • canaries keep failing at archive step

              Sheriffs: johnylin

              • some PFQ builders are timing out in HWTest (bvt-inline)
              • gale, whirlwind: BuildPackages fail (liblightcontrol make fail)
              • Not enough DUT for falco_li in lab (under request), see also b/33249596
              • asuka, auron-yuna, banon, celes, gandof, lulu, failed at Paygen stage with [Errno 28] No space left on device
              • chromeos-bmpblk broken for poppy --> poppy-paladin broken
              Resolved Issues:

              Sheriffs: dtor, martinroth, yhanada
              Infra: kevcheng

              Resolved Issues:

              Sheriffs: dianders, itspeter
              Infra: sbasi

              Resolved Issues:

              Sheriffs: sonnyrao, benchan, mtomasz
              Gardeners: stevenjb

                  Ongoing Issues:
              • fail gsutil uploads with AccessDeniedException 403

              • build 2448 failed due to provisioning error

              • failure in

              • tests failing in canaries with no individual test logs

              • rejecting manifest pushes ("failed to lock" error)

                  Resolved Issues:

              Sheriffs: jinsong, puthik, hungte

                  Ongoing Issues:
              • None

                  Resolved Issues:


              Sheriffs: mcchou, mruthven, ravisadineni

              Internal Waterfall:
                 Ongoing Issues:

              • Lumpy provision failed due to Unhandled DevServerException: CrOS auto-update failed for host chromeos6-row2-rack7-host12
                (b/33185795 is filed for tracking the offline status of this bot.)

              • Paygen issue on arkham-release builder seems to be the reoccurance

              • lakitu-release builder GCEtest failed

              • guado_moblab-paladin failed at HWTest stage with moblab_RunSuite: FAIL: Unhandled AutoservRunError: command execution error

              • arkham-release builder failed at Paygen stage with cannot find source stateful.tgz error

              • veyron_speedy, wizpig failed at AUTest stage with image installation failure

              • and broke master-paladin

              • oak-release builder failed at HWTest stage with "(2006, 'MySQL server has gone away')" error

              • falco_li-release failed due to lack of DUT

              • sentry-release, Inconsistent propergation for the same test failures.

              • provision failure, Unhandled DevServerException: CrOS auto-update failed for host chromeos2-row3-rack1-host21

              • falco-chrome-pfq failed due to network issue
                b/33249596 P0 filed for syslab to troubleshoot

              • build_packages error due to authpolicy on x86-generic

                 Resolved Issues:

              Public Waterfall:

                 Ongoing Issues:

              Sheriffs: drinkcat, groeck, furquan

                  Ongoing Issues:

              Please follow up on these, at least:
              • Lots of -paladin builders failures during ImageTest ( contains unsatisfied symbols). Had to pin Chrome.
              • VMTest in GCE instances?!
              • squawks pool:bvt unbalanced (please check what's going on?)
              • cros-beefy23-c2 out of disk space
              • guado_moblab: bad DUT
              • - Inadequate DUTs for falco_li
                • Maybe not be fixed in the immediate future (we are short on HW)
              • - wizpig/terra-release builders fail during HWTest: An operational error occured during a database operation: (2006, 'MySQL server has gone away')
              Less critical:
                  Issues from last week:
              • - invalid oauth credentials. Some slaves were unable to retrieve images from google storage resulting in AUTest failures on the Canary waterfall.
              • - ssp picks random devserver.  Patches in place to mitigate.
                  Resolved Issues:
              • terra-release. Bad DUT
              • - kevin-tpm2 keeps failing (jwerner has a fix)
              • - wizpig-release HWTest has been failing continuously for a few days.
                • Bad DUT
              • - glados-release SignerTest failure (should be fixed)
              • - pool: bvt, board: x86-mario in critical state (should be fixed)
              • - x86-{mario/alex}-{paladin/release/chrome-pfq} failure (also seems to affect other x86 3.8 boards like peppy/falco/lumpy/etc)
              • - veyron_minnie-android-pfq not running (builder offline)
              • - sentry-release experiencing test timeouts (probably duplicate of 666070)
              • - cros-beefy70-c2: Disk almost full, glimmer-cheets-release Paygen failures
                  PFQ (gardening) issues:
              • None?
              Sheriffs: skau, ntang, pgeorgi
              Gardeners: jennyz

                  Ongoing Issues:
              • - invalid oauth credentials. Some slaves were unable to retrieve images from google storage resulting in AUTest failures on the Canary waterfall.
              • - Bad DUT for guado_moblab-paladin
              • - Inadequate DUTs for falco_li
              • - sentry-release experiencing test timeouts
              • - ssp picks random devserver.  Patches in place to mitigate.

                  Resolved Issues:
              • - x86-alex-paladin reports DUT unplugged. Actually, bad firmware CL in CQ.
              • - Not enough DUTs for buddy-release
              • - Lab restarted overnight. Caused 2 wedged slaves.
              • - Perceived lab slowness. Shard schedulers required restart.
              • - oak-paladin and reef-paladin failed due to bad restart of slaves
              • - peppy-release running client jobs as server jobs due to a bad image from devserver.
              • - No cyan boards for hw_video_acc_enc_vp8.  Misread debug message as error message.  Failure is expected.
              •  - Multiple canaries failing due to overnight ganetti restart.
              • - daisy_skate-paladins failing provision_AutoUpdate.double
                  PFQ (gardening) issues:
              • None?

              10/31- 11/06
              Sheriffs: tfiga, dlaurie, yueherngl, semenzato (honorary)
              Gardeners: jamescook

                  Ongoing Issues:

                • StageControlFileFailure due to DownloaderException
                • Canary runs fail with "DevServerException: stage_artifacts timed out"
                • related: Chrome LKGM is stale due to parrot-release failures
                • drone cannot connect to cloudSQL
                • login_Cryptohome fails nearly constantly on x86-generic-tot-asan-informational -> address space exhaustion on 32-bit Intel ASAN
                • b/32653128 - veyron_speedy-paladin constantly failing on an ARC++ related HWTest

                    Resolved Issues:

                  10/24- 10/31
                  Sheriffs: kirtika, mka, deanliao, semenzato (honorary)
                      Ongoing Issues on canaries
                  • SetupBoard failure, last ~10 parrot canaries failed. 
                  • Provision failure with error "Devserver portfile does not exist".
                  • AUTest fails with kOmahaErrorInHTTPResponse (37)
                  • No output from BackgroundTask for 8640 seconds
                  • To look into: guado paladin caused consecutive master paladin failures on Friday
                      PFQ (gardening) issues
                  • New issues:
                  • - Last AU on this DUT failed, The python interpreter is broken, completed successfully (happened once)
                  • - HWTest security_SandboxStatus failed on elm and veyron_mighty paladin for two times.

                  • Ongoing Issues:
                  • - MobLab Failures in the CQ: dhcpd is not running. Crashing on shill restart (single occurrence)

                  • Resolved issues:
                  • b/32420834 -  Slow UI with 500 Internal Server Error on a CL with many comments (pre-cq-launcher failed to fetch the CL

                  10/17- 10/23
                  Sheriffs: cychiang, briannorris, semenzato (honorary)
                  Gardeners: dshi, jrbarnette
                          Ongoing Issues on canaries:
                  • autoupdate_EndToEndTest, many different failures
                  • autoupdate_Rollback
                  • provision_Autoupdate.double
                  • other provisioning failures (rsync errors, timeouts, error 37)
                  PFQ (gardening) issues:
                  • New Issues:
                  • - lakitu cloud_SystemServices flakiness
                  • - autotest-web-tests build errors are too opaque
                    • Filed, noted a potential fix
                  • - Not enough falco_li DUT in the lab.
                  • - kunimitsu-release: build_packages failed on autotest-deps-ltp with undefined ltp_syscall, happen once.
                  • - guado_moblab-paladin: moblab_RunSuite: FAIL: Unhandled AttributeError: '_CrosVersionMap' object has no attribute 'get_stable_version'
                  • - celes-release, gandof-release: signing failed due to gsutil/ssl timeout
                  • - pre-cq failed because nyan_freon is removed
                  • - x86-mario-release: security_ModuleLocking timed out
                  • - Falco device chromeos2-row4-rack5-host7 is flaky in provision
                  • - multiple paladins: security_ptraceRestrictions: DUT rebooted during the test run.
                    • Caused by bad CLs that made it through for
                    • Poor Kernel 3.10 HW coverage:
                    • Bad CL in 3.10 has been reverted, but still flushing out of some canaries (2016-10-20)
                  • - Nearly all canary failed: paygen and AUtest fail to install device image.
                  • - chell signing/paygen failing due to new kernel cmdline flag
                  • - jetstream_LocalApi failure
                  • - wolf + veyron_speedy DUT availability
                  • - kunimitsu build failures
                    • Still not resolved; there's no paladin?
                  • Resolved Issues:
                  • - Chrome PFQ manifest errors
                    • Waiting for next PFQ runs to come through
                  • build_packages fail on almost all release builders, some paladin builders.
                  • - security_SandboxedServices failure "One or more processes failed sandboxing"
                  • - canary build failure because of minijail tree change. uprev of ebuild chumped. Fix to security_SandboxedServices chumped.
                  • - autotest-web-tests issues on guado_moblab-paladin (experimental)
                  • root caused to libcups/icedtea-bin - fix is in flight
                  • cave-release: Fail to resolve host name for cros-beefy19-c2
                  • b/32292437 - DUTs in pool crosperf are all 'repair failed'
                  • Need to push change to autotest shard.

                  10/10 - 10/16
                  Sheriffs: chirantan, julanhsu, kinaba
                  Gardeners: lpique, dbehr

                  PFQ (gardening) issues:
                  •  New Issues:
                  • - guado_moblab: Repair failing. Happened once, didn't reoccur
                  • - falco-chrome-pfq failing since build 4821 with apparent network issues after updating. Filed after digging into one of the failures on falco, and noticing that in one case the infra didn't reconnect to the DUT after it was provisioned. Possibly related to where it falco becomes unpingable during provisioning.
                  • - select_to_speak exists build error. Occurred once.
                  • - Microcode SW error detected. Occurred once.
                  • - [bvt-inline] security_SandboxedServices failure on lumpy-chrome-pfq (flake). "awk cannot open /proc/xxx/status" because the process ended between when the filename was generated and when awk tried to open it.
                  •  Ongoing Issues:
                  • [falco-chrome-pfq] almost always red
                    • - provision failure "Device XXX is not pingable". This has plagued the falco-chrome-pfq builder, and is one of the main reasons we didn't automatically uprev Chrome this week.
                  • [x86-generic-tot-asan-informational] almost always red
                    • - login_Cryptohome fails nearly constantly on x86-generic-tot-asan-informational.
                  • [ChromeOS Buildspec] red for M54 builds
                    • - browser tests failing M54 builds on ChromeOS Buildspec builder. Landed a fix on the M54 branch that was made after the branch was cut, and was otherwise missed. For the builds to go green, we need a new M54 release though, since the builder pulls the current stable version release.
                  • [Chrome4CROS Packages] always red
                  • [lumpy-chrome-pfq] occasionally red
                    • lumpy-chrome-pfq HWTest [bvt-inline] timed out waiting for json_dump. This is still happening, as the build time is too long occasionally. Added a note to the bug about certain tests taking much longer than the mean according to the gathered statistics when this occurs.
                  •  Resolved Issues:
                  • - Manually uprev Chrome to 56.8891.0.0 for Chrome OS. Since we otherwise would not have done so at all this week.
                    • Actually there happened to be a green master run late Friday, for the first time in nine days.
                  • - BuildPackages broken in multiple chrome-pfq builders. The CL for the fix landed and the builds were fixed Monday.
                  • - (New) Media.VideoCaptureGpuJpegDecoder.InitDecodeSuccess not loaded or histogram bucket not found or histogram bucket found at < 100%". Caused failures on peach-pit. The fix landed early Thursday.

                  10/3- 10/9
                  Sheriffs: rajatja, denniskempin
                  Gardenersihf, glevin
                  • DebugSymbols error. Happens occasionally across boards:
                  • AU Retry issues:
                  • message_types_by_name error in dev_server:
                  • buddy_release has been failing for weeks: need to investigate
                  • gandof-release:
                  • GSUtil timeout issues:
                  • sentry-release: Some odd issues with HWTest need to investigate
                  • bots failing graphics_Gbm check during hwtest

                    PFQ (gardening) issues:
                  •  New Issues:
                  • - BuildPackages broken in multiple chrome-pfq builders.  There's a CL  for the fix, but it hasn't been committed yet.
                  • - AboutTracingIntegrationTest.testBasicTraceRecording failing on x86-generic-telemetry and amd64-generic-telemetry.  CL to disable the test currently under review.
                  • , , , - Autobugs for occasional HWTest provision flakes, mostly masked by 653900 since Thursday.
                  • - falco- and tricky-chrome-pfq's failed w/timeouts during  Occasional flake, but no logs, no work done.
                  • - lumpy-chrome-pfq HWTest [bvt-inline] timed out waiting for json_dump.  Flaked once, didn't recur.
                  •  Ongoing Issues:
                  • - Chrome4CROS Packages builder still broken (3+ weeks)
                  • - Still happening on x86-generic-tot-asan-informational, with occasional successes slipping through.
                  • - Occasional flake in PageLoadMetricsBrowserTest.FirstMeaningfulPaintNotRecorded
                  • - HWTest[bvt-inline] : "security_NetworkListeners FAIL: Found unexpected network listeners".  Single flake, waiting to see if it recurs.
                  •  Resolved Issues:
                  • - [VMTest - SimpleTestVerify] failing on cyan-tot-chrome-pfq-informational : "Could not access KVM kernel module".  Reverted offending CL, builder green since then.
                  • - Linux ChromiumOS Tests (dbg) failure of two DevToolsAgentTest.* tests.  Issue contains cause, revert, and subsequent fix.
                  • - Linux ChromeOS Buildspec Tests failed intermittently for weeks.  Failure not seen since 10/7, when issue comment suggested that potential fix had landed.
                  • - Multiple generic pfq builders failing with "Invalid ebuild name".  Fixed.

                  9/26 - 10/2
                  Sheriffs: dbasehore, akahuang
                  Gardenersjdufault, glevin

                  9/19 - 9/25
                  Sheriffs: apronin, charliemooney, vpalatin
                  Gardeners: stevenjb
                  • chromiumos-sdk failed to build (missing efi.h) - fixed, build CL at fault CL to fix
                  • Cyan has broken/flaky test performance in ToT, was causing CQ failures bug here
                  • DataLinkManager crashing and breaking Canaries bug here (fixed: CL reverted)
                  • Surfaceflinger crashing on oak bug here
                  • Paladins fail to connect to MySQL instance bug here
                  • Canaries were failing with "no attribute 'SignedJwtAssertionCredentials'" bug here (workaround CL submitted)
                  • arc_mesa builds broken on auron, buddy, gandof, lulu, bug here, mostly fixed, buddy still fails as of buddy/428
                  • manifest generation fails w/binary data in commit messages (e.g. CL:387905)
                  • libmtp roll broke build packages due to autotools regen (fixed in CL:389031)
                  • Root FS is over the limit for glimmer bug here
                  • Reef builds were broken (unit tests failed to build), fixed here
                  • Gru builds are broken (fail during uploading command stats) due to this CL, bug here, CL to fix
                  • Some CLs are not marked as merged in Gerrit after a CQ run bug here
                  • Tests that succeeded but left crashdumps frequently aborted on crashdump collection timeouts bug here, crashdump symbolication turned off if tests passed (here)
                  PFQ (gardening) issues:
                  • Chrome4CROS Packages builder failing in compile -
                  • login_Cryptohome fails nearly constantly on x86-generic-tot-asan-informational -
                  • login_OwnershipNotRetaken fails regularly on PFQ. -
                    • Ongoing investigation
                  • Shutdown crash in ~ScreenDimmer > SupervisedUserURLFilter::RemoveObserver -
                    • FIxed
                  • Several PFQ failures due to timeouts -
                    • Some timeouts are triaged, but some still need investigation

                  9/10 - 9/18
                  Sheriffs: cernekee, kkunduru, chinyue

                  9/5 - 9/9
                  Sheriffs: jdiez, dhendrix, mcchou, josephsih
                  Gardeners: achuith
                  • Mostly having issues that affect many builders.
                  • Canaries failing due to "HWTest did not complete due to infrastructure issues (code 3)", suspect b/31011610. May file more bugs...
                  • Several builders failing due to misconfigured cheets_CTS test:
                  • Kevin failing badly:
                  • master-paladin infra failures (build 12292): this CL broke several paladin builds. Told the CL owner not to mark ready before fixing problems.
                  • master-paladin infra failures (build 12294): failed 4 consecutive times. 20 paladins did not start in CommitQueueCompletion. Similar to build 12281 yesterday but build 12283 passed later.
                  • provision_AutoUpdate.double ABORT: Timed out, did not run.
                    • master-paladin infra failures (builds 12301, 12302): failed in these 2 builds
                    • Looked similar to crbug/593423: Need to watch this as more builders were broken due to the timeout issue.
                    • Build 12303 passed. Flaky?
                  • signers failing while signing android apks:

                  8/29 - 9/4
                  Sheriffs: kitching, bleung, yixiang@
                  Gardeners: michaelpg, afakhry
                  • CQ paladin build #12207 failed due to whirlwind-paladin #5640 HWTest jetstream_ApiServerAttestation failing, but passes in #5641
                  • CQ paladin build #12215 failed due to many repo sync errors (example: daisy_skate-paladin), looks like subsequent builds do not exhibit repo sync problems
                  • CQ paladin build #12216 failed due to:
                  • CQ paladin build #12218 failed due to "No room left in the flash" Vpalatin knows about it and looking for ways to make it fit. 
                  • - Slave frozen, needed to be restarted.
                  • - Timeout on Paygen curl /list_suite_controls (auron-release)
                  • - Timeout on Paygen curl /stage (banon-release)
                  • - Paygen suite job timed out despite all PASSED
                  • - buddy-release: Paygen suite job timed out, all tests FAILED/ABORT
                  • Top Issue on 8/31 - - lab database problem
                  • b/31011610 - ATL14 packet loss bringing down ChromeOS Commit Queue
                  • - guado_moblab broken due to testing outage
                  • -  nyan_freon-paladin timed out during p2p unittest
                  • - gru-paladin attestation unittest failure. Possibly flaky test. apronin@ looking at fixing test. Also affects gale-paladin
                  • - All paladins failed during CommitQueueSync.  akeshet@ theory is that backlog of CLs (especially on kernel repo) overwhelmed GoB. akeshet@ put in a CL to temporarily limit CQ volume to 50 : TODO: Revert this once the backlog is cleared. nxia@ also added this mitigation :
                  8/22 - 8/28
                  Sheriffs: bhthompson, nya, walker
                  Gardeners: jennyz, lpique
                    8/15 - 8/21
                    Sheriffs: benzh, sureshraj, yoshiki
                    Gardeners: jamescook, domlaskowski
                    • security_StatefulPermissions failures on canaries: 
                    • provision_AutoUpdate.double failures on chrome pfq informational: 
                    • SyncChrome failures due to "Repository does not yet have revision" on chrome informational pfq -> infra, ongoing flake
                    • Chrome telemetry failures due to missing system salt file -> reverted
                    • cyan chrome pfq informational builder cros-beefy191-c2 is out of disk space building chrome -> infra
                    • pool: bvt, board: falco in a critical state -> infra
                    • Chrome4CROS Packages builder failing in bot_update "fatal: reference is not a tree" -> infra
                    • VMTest failing on telemetry bots due to telemetry_UnitTests_perf -> bug in test script?, disabled
                    • cros amd64-generic Trusty builder failing to start goma in gclient runhooks step -> networking flake?
                    • login_CryptohomeIncognito -> flaky, but real failure
                    • cheets_NotificationTest failure on Cyan PFQ -> real failure in chrome (crash in shelf)
                    • falco-full-compile-paladin has failed to start with exception setup_properties
                    • x86-generic-tot-asan-informational failures in tpm_manager (odr-violation) and attestation (leaks) -> new target added to cros build that had failures, reverted
                    • Kernel panics on Cyan PFQ -> ???
                    • link-paladin BuildPackages failure with SSLError The read operation timed out
                    • AUTest failed on most canaries due to no test configurations
                    8/8 - 8/14
                    Sheriffs: davidriley, vprupis, takaoka, smbarber (Mon afternoon only)
                    • Continued UnitTest failures on canaries and release branches:
                    • lakitu failures:
                    • edgar missing duts:
                    • kevin firmware prebuilt:
                    • x86_alex and veyron_rialto pool health: and
                    • Chumped change broke everything (eg pre-CQ, CQ, canaries) until revert was chumped in
                    • infrastructure flake
                      • celes-release/289, setzer-release/292 (build interrupted) ->
                      • nyan-release/293, wolf-release/1294 (sudo access) ->
                      • pre-cq (gerrit quota limits) ->
                    • Friday: lab downtown affected builds for much of the day
                    8/1 - 8/8
                    Gardeners: stevenjb@, khmel@

                    7/29 Notes for the next sheriffs from aaboagye, kirtika: 
                    • Major issues we are seeing, format is <Impact: Issue: Links>::
                      • Tree closure, fixed now: "No space left on device" for cheets builds: aaboagye@'s post-mortem here.
                      • CQ failures: We've been seeing intermittent failures due to hitting git fetch limits with gerrit (commit queue sync step doesn't work). The current CQ run failed due to this, would not be surprised if the next one does too.
                      • Several canaries failing: Unit-test times out, possibly due to overloaded machines:
                      • Android-PFQ failures: adb is not ready in 60 seconds:
                    • Minor issues, work-in-progress
                      • Android-PFQ: mmap_min_addr not right on samus/x86:
                      • Paygen/signing issues.
                      • Autoupdate-rollback (likely network SSH issue): example

                    2016-07-25 thru 2016-07-29
                    Sheriff: aaboagye, kirtika, hidehiko (non-PST)

                    • PST
                      • Canaries
                        • kevin-release was broken, but a fix is on the way. (wfrichar@ knows)
                      • CQ
                    • Non-PST:

                    • PST
                      • Canaries
                        • Still seeing the error in the unittest phase. See
                        • Paygen issue still affecting some canaries (x86_alex-he -
                        • Saw a failure with auron_yuna canary with an error parsing a JSON response. See
                        • samus failed with platform_OSLimits Found incorrect values: mmap_min_addr. Filed
                      • CQ
                        • Closed the tree because the CQ would just reject people's changes because of the no-disk-space error.
                      • Chrome PFQ
                        • Still seeing some failures in the login_CryptoHomeIncognito test. See
                    • Non-PST
                      • CQ:
                        • RED.
                        • samus-paladin is failing due to no-disk-space error.
                        • cheets tests are failing two times with actual error ( Being fixed.
                      • Chrome PFQ:
                      • Android PFQ:

                    • PST
                      • Canaries
                        • Seems like nearly all the canaries failed during HWTest stage apparently due to Infra issues.
                      • CQ
                        • On one run, some of the paladins failed during the CommitQueueSync step due to git rate limiting.
                      • Android PFQ
                        • An overloaded devserver is causing provisioning to fail for cyan-cheets-android-pfq and veyron_minnie-android-pfq (wolf-tot-paladin too).
                    • (Non-PST)
                      • CQ:
                        • Master paladin looks flaky due to various reasons.
                          • CQ limit hitting
                          • HWtest time out
                          • kOmahaErrorInHTTPResponse: looks a tracking issue. 
                        • These look not always reproducible, and some runs pass successfully.
                      • Chrome PFQ:
                        • Finally passed at #3175.
                      • Android PFQ:
                        • Failing in latest several runs. Though the reasons are variety. Looks just too flaky.

                    7/26 (18:20 PST)
                    • Canary Failure Classification: Lots of canary failures (~50%) this afternoon, so listing unique causes here to track down tomorrow: 
                      • x86-zgb: Pool-health issue, infra (kevcheng@) looking into it, may be back up next canary run? 
                      • x86-mario: Not sure if the manifestversionedsync is a real issue or not, filed anyway. 
                      • Paygen failures: falco, falco_li, gru, jecht, kip, lumpy, ninja, parrot, peppy, samus, smaug, x86_alex-he, stumpy. TBD: Update more details here. 

                    • (PST)
                      • Canaries
                        • Still some errors on nyan_blaze and nyan_kitty caused by the vboot_firmware CL.
                          • Fixes posted to gerrit and making it's way through the CQ.
                        • Still some unittest failures. There's a CL that just landed to reduce the parallelism. Will be following to see if the situation improves.
                          • That CL did not seem to resolve the issues.
                        • Saw a few canaries yesterday (celes this morning) that had issues when uploading debug symbols. dgarret@ is working on a fix.
                        • security_StatefulPermissions is pretty flaky, veyron_minnie canary failing on it. wmatrix is all red: Investigating
                        • There was canary failure on lars-release which reported all the DUTs in the pool as dead, but they seem to be up now.
                        • x86-zgb pool health is poor - most devices down. kevcheng@ taking a look.
                        • Towards the end of the day, a larger number of canaries were failing at the paygen step. I think what may be happening is network flakiness, but I wonder why we don't just retry again?
                      • CQ
                        • panther_embedded-minimal-paladin has been down for quite some time now. Pinged the bug to see if there are any updates.
                          • A restart of the master has been scheduled. Need to check back later today if that fixes things.
                        • No elm devices in pool:cq making elm-paladin fail. kevcheng@ taking a look. No bug yet. 
                      • Android PFQ 
                        • harmony_java_math CTS test is causing failures with its causing android-pfq failures "cts test does not exist".  Filed b/30413761. Ping ihf@ if it doesn't get better. 
                      • Chrome PFQ 
                    • (Non-PST)
                      • Canaries
                        • platform_FilePems issue was fixed by yusukes@.
                        • Investigated a bit more about UnitTest failure. Not yet reached to root cause.
                      • CQ
                        • Looks flaky: Sometimes failing ErrorCode=37 (OmahaErrorInHTTPResponse).
                      • Chrome PFQ:
                        • Looks flaky. Sometimes failing due to login error, but there is variety of failing boards.

                    • Canaries
                      • Several of the canaries were failing in the platform_FilePerms HwTest.
                        • This was seen on cyan, elm, lulu, oak, samus, and veyron_minnie.
                        • Appears to be missing expectations for ARC containers.
                        • Filed
                      • The unittest stage seems to be timing out somewhat fairly often now.
                      • nyan-big is failing on a vboot_firmware CL not building. Filed Fix is in CQ now. 
                    • CQ 
                      • Generally okay today. There was one issue regarding a failure in VMTest, but that was caught.

                    2016-07-18 thru 2016-07-24
                    Sheriff: wuchengli

                    • 628990: DebugSymbolsUploadException: Failed to upload all symbol
                    • 593461: Chrome failed to reach login screen within 120 seconds
                    • 628494: chromeos-bootimage build failures in canary builds
                    • 609931: 'chromite.lib.parallel.ProcessSilentTimeout'>: No output from <_BackgroundTask(_BackgroundTask-5:6:7:3, started)> for 8610 seconds
                    • 629094: cannot find source stateful.tgz

                    OLDER ENTRIES MOVED TO THE ARCHIVE so this page doesn't take forever to load.  See Sheriff Log: Chromium OS (ARCHIVE!)