Autotest Test Scheduler V2We’ve recently added dynamically-defined test suites for Chrome OS, but can currently only run them synchronously from our build slaves. That works great for a BVT suite, but is less than ideal for longer-running suites, suites whose results are not needed immediately, and suites that need to be run on a particular build on-demand.
The scheduler must be able to:
There are two conceptual classes of tasks in the scheduler: timed-event tasks and build-event tasks.
On each event, the scheduler will wake up and, for each known platform, ask the dev server for the name of the most-recent staged build. It will then run through all the tasks configured to run on this event and query the AFE to see if a job already exists for this (platform, build, suite) tuple. We’ll facilitate this by having a convention for naming jobs, e.g. stumpy-release/R19-1998.0.0-a1-b1135-test_suites/control.bvt for a nightly-triggered regression test of the stumpy build ‘R19-1998.0.0-a1-b1135’. If an appropriate job does not exist, the scheduler will fire-and-forget one.
This de-duping logic allows us to avoid a lot of complexity; we can simply throw tasks on the to-be-scheduled queue in the scheduler, and trust that they’ll get tossed out by this check if they would duplicate work.
As the suite definition files themselves currently handle sharding, we don’t need to take it into account here.
We support a couple symbolic branch names:
We also support release branch cutoffs. If a Task can’t be run on branches earlier than R18, say, we allow you to specify ‘branch: >=R18’ in your task config.
We’ll start with just ‘nightly’ and ‘weekly’. The time/day of these events can be configured at startup.
When a timer fires, the scheduler will enumerate the supported platforms by querying the AFE. Then, for each platform and branch, we can glob through Google Storage to find the latest build (this requires that we put factory/firmware builds in appropriately-named buckets in GS). Now, we can go through all triggered tasks and schedule their associated suites. This may include some already-run suites, but they will be thrown out by our de-duping logic.
Initially, we’ll support two events: ‘new-build’, and ‘new-chrome’. The scheduler will poll git, looking for updates to the manifest-versions repo(s) and the chrome ebuilds.
Every time a new build completes, a manifest is published to the manifest-versions git repo (internal or external, depending on the build). By looking at the path that file is checked in at, we can determine what target it was built for, and what the name of the build is. By looking in the file a bit, we can determine what branch the build was on. With all that info, we can look through the configured tasks, and figure out which ones to schedule.
When we see that a Chrome rev has occurred, we can register each ‘new-chrome’-triggered task to run on the next new build for all supported platforms.
We will use python config files.
We considered using buildbot in a couple of different ways, but ultimately decided against it. The largest issue here is that the builders all live in the golo, and the lab does (and will) not. We currently cope with this for the sake of HWTest because we need to return results to the build infrastructure. We prefer to avoid adding further dependencies on the ability to talk from golo <-> corp unless it is strictly necessary to achieve desired functionality. It’s certainly not necessary to add this complexity from the point of view of correctly scheduling tests.
One proposed approach could have us using build slaves that trigger when the real builders finish to handle event-triggered suites, and one that kicks off at the same time every day to run timer-based suites. Issues are these:
Another approach would have us setting up a slave for every suite/trigger/build combo. This has similar maintenance headaches as the above.
This also would require communicating from two disconnected networks, so we also decided against it.