OPINION EXPRESSED IN THIS DOCUMENT ARE THE VIEW OF MARC-ANTOINE RUEL AND ARE NOT REPRESENTATIVE OF THE VIEW OF HIS EMPLOYER, GOOGLE INC. </blink>
Initially presented at agilis.is conference in Reykjavik, Iceland on 2011-11-14 by Marc-Antoine Ruel. This document is a flattened version of the presentation done there.
As a team, we provide the infrastructure necessary to enable the current release process, which I'll describe during the presentation.
You cannot approach client project infrastructure the same way you do for web services. The deployment cost is largely different. The feedback loop, to know what's happening on the user's machine, has much higher latency.
Most web services and web sites use really fast release cycles. I've heard some big site having a 6 months release cycle but that's not the norm. And I presume if you are here today, you don't want to be on the laggard side. I'm going to focus this presentation on client side which has totally different and interesting challenges. We wanted Google Chrome to feel like the web and not like shrink wrapped software. My talk is focusing on techniques used to enable this vision.
Reduce friction for
First, you cannot be agile if you have friction along the path. Being agile is all about reducing the friction from the inital requirement, to the updated requirements to the end result, all in an iterative feedback loop. If you don't look at the whole picture, the end-to-end experience breaks down.
To ensure a delightful experience for the user, we need to react fast. I'll focus on four points in this presentation with how we addressed each of these friction points.
Let's start with friction for developers. That what most people thinks about for agile development, staying up to date with the requirement from your client and updating accordingly. But in practice that's not sufficient. There's mentality and practice changes to ensure that the lean small team agility scales to hundreds of developers.
Feature branches are the worst of all evils. Let me explain. Most of the time, the feature doesn't work well on the branch, but everytime, it doesn't work at all when merged back on the master branch. Even more, they are disruptive to merge and most of all; they always need to be merged yesterday.
The incentives are just plain wrong. First, it's too easy to underestimate the cost of the feature branch for a product manager. The developer wants a stable code base to work on. But staying on old code means that he doesn't see any refactoring going on, so when he merges back, he may need to redo his change completely on the new code. What this means is that after a while, nobody evey dare to refactor the code in fear of retaliation.
Most of our features are developed directly on the master branch and disabled just after forking a release branch if not stable worthy. Let me do a pause here, we have 4 release channels in increasing order of freshness, stable, beta, developer and canary. The canary strives to be a daily build. That's why the canary is very unstable as all the experimental features are enabled, once a developer release branch is made, all the experimental features that have no chance to go to stable in their current state are disabled on the branch. They'll have their chance in 6 weeks as we do new major release every 6 weeks. Devs like to work on a branch that works for them so they can do a fast code/build/debug cycle. If unrelated breaking change breaks their local checkout, they will be staled. Scaling up the number of developers means scaling up the number of commits per hour.
First line of defense is code reviews. We require it for almost all of the commits going in. There are exceptions but in general, not doing code reviews is frowned upon. Code reviews are a great tool to improve the code quality before it gets checked in, and has the side effect of helping new contributors to ramp up much faster by getting immediate feedback for their first change. Having effective code reviews require the developers to care about giving thoughtful feedback. That's a culture that is hard to enforce after starting a project.
Question: Who does systematic code reviews on their projects?
Question: Who uses continuous integration in their projects?
To work on master branch, the code must work. It's all about making sure each and every single commit is valid enough so that all the main functionality of the project is kept. Let me precise a little bit here. We don't want to product to be completely perfect; good enough will suffice. Good enough that we can actually release a build directly from it. That's a huge mentality ship for certain developers but that's a requirement. This also requires a huge investment in the infrastructure. In practice, we use buildbot, which is a open source python framework to manage a centralized build system. It's also used by WebKit and Mozilla and others. In practice, the tool doesn't matter, the process matters. We try to keep the testing cycle as low as possible to reduce the latency so the feedback loop from commit to test results is as tight as possible. To help with reducing tree breakage in the first place, we also use pre commit tests.
A colleague wrote recently; we often discuss our software engineering practices, especially code review and testing, but I think that sometimes the reasons for our practices get lost in the mix. We aim:
This is in support of our goal to release as often as possible, which makes us more agile. We absorb the extra costs of thorough code reviews and testing since we believe that these things help keep us agile and reduce costs in the long run. Code reviews and automated fuzzing also reduce security risks.
These goals and practices are really suited to the age of the Internet, where software deployment is fast and easy. Compare that to a few years ago where people where still getting their software on DVDs!
Question: Who installed software on DVD/CD this year?
It's especially true for server-side only projects but it is now also true for projects with fast auto-updates à la Google Chrome.
This busy graphic represents our release process. Each line is a branch in our source tree. Horizontal lines are what is released to our users. An aborted branch is one that didn't pass QA. The red section is the developer channel, the yellow section is the beta channel and the green section is the stable release.
We also have what we call the canary channel but I'm not depicting it here because it's a release done directly from trunk. This is to prove that trunk is always stable enough to make a release out of it. The canary channel strives to be daily or at least every few days. The developer channel is roughly weekly. We then release one major beta channel and stable channel release every 6 weeks. A dot release can be done on the beta channel or stable channel to introduce more fixes, which is depicted here as branch inside the same channel.
Indeed, we have very good people that manages these branches. The trick to make it workable is to send our dev channel release in the field, look for bugs found out and merge fixes back in the beta or stable channel as needed. Also, and let me put emphasis here, every code change is done on the trunk first and then merged back to a branch. No change is ever directly done on a branch.
We try to have our branches as short lived as possible. With a 6 weeks of dev, beta and stable channels, a branch lives roughly for 18 weeks, or 3 months. Which means that we don't care about any code that is more than 3 months old, which helps us tremendously to stay flexible on the code base refactoring. That's really a huge advantage. The pipelining here is really what makes us much more efficient because we get different kind of feedback from each of the channels. Let's talk about user feedback.
Users don't like questions and second, they don't like to be interrupted. Many of them don't want to do the the cognitive effort of task switching when it's not their will. Every single dialog box is a context switch. So we strive to only ask meaningful questions. For example, that's the main reason we are not installing in the Program Files directory, it's just too much friction for the user. Delight the user and make the application as easy to use as a web page.
FYI, UAC prompt is the dialog box with all the remaining of the screen being grayed out, specific to Windows.
Thanks for wasting my precious time, folks. So please don't make your users angry at you. Nobody likes a dialog box poping during a web based presentation.
The reason I'm raising this point is that it's important to reduce the number of different state each client can be. As you increase the number of one-off patches, having users to upgrade multiple modules independently, you increase the odds of something going bad. So it's easier, at least for us, to make sure everyone gets the same bits and disable functionality as needed.
Question: As an example, who used Google Talk video chat or Google+ Hangout? Who recalls installing the plugin to make that work? Who recalls updating the plugin? Well, it probably updated since you installed it, you just never realized and you probably don't care.
By releasing more frequently, the number of changes per upgrade is reduced. This means that users will see smaller changes more frequently, instead of big changes every year or so. This makes it easier for users on the long run, as you don't need to do big training for non-tech-savvy users. Also, users become used to see gradual changes occuring, which reduces the natural change aversion some users may tend to have.
Also, decoupling the automatic update software from the client software helps recovering in case of failure. I recommend to keep them as separate as possible from a software version standpoint. Especially when the failure comes from third party software; In our case, we use Omaha, which we open sourced. I'll leave a reference at the end of the presentation.
One interesting thing once a project grows large enough is to be able to do controlled A/B testing with a small percentage of your users. To be able to do this, all the infrastructure must be set to be able to disable the experiment really fast if something goes bad. Also, the intrinsic goal of the testing is to gather feedback so we must make sure we are getting it. On a web service, it's relatively easy to manage. It's much more fun to do on client applications. We made privacy aware anonymous feedback infrastructure to gather experiment information only from users who opted to send statistics.
This gives valuable information for things where the test matrix is too large. Here are a few examples: test OpenGL implementation robustness, test effectiveness of different on disk web cache heuristics, implement new SSL designs and verify its compatibility in the field, tweak omnibox or autofill heuristics, network packet retries heuristics, etc. Lots of things where it's really about tweaking the best values or the best automatic selection of values. The important thing here is to correctly do the measurement and do it both anonymously and in an non-intrusive way.
In that case, I'm specifically referring to third party software. There's the third party software extending the browser, like plugins and extensions, but the actual web pages are also to be considered third party software, since it's code interpreted inside the browser. Each new version of the browser may affect the way the plugins, extensions or web pages are handled.
Actually, we solve the problem since we make the update ping infrastructure Google Chrome uses for itself available for extensions too. This is a huge benefit for extension developers because it's like having all your users running tip of tree all the time, also on the latest version of the browser, with only a worst-case propagation delay of 5 hours. All this is free for extension writers. This is a huge difference from the experience on other web browsers. And extensions are updated as you are using the web browser, without ever requiring the user to reopen his windows to be updated. This is a huge benefit for the users since I found out upgrading extensions can be very annoying in some case for some applications.
Unless your process is running on dedicated hardware like a gaming console, you need to try to show respect to the other processes so the OS doesn't misbehave. Here's an example of things we do. Bombing the system with a number of utility processes. Using too much physical memory. Using too much bandwidth for automatic upgrades.
One workaround is to use incremental upgrades so we only download the delta. Another is to limit the number of processes and further more by the amount of physical memory. The processes use shared memory as much as they can and trim the in-cache memory aggressively. We slow down background tabs to reduce CPU usage, to increase battery life. We slow down the various timers while on battery, also to increase battery life. But still, what if you are breaking hooks from an anti-virus software or even worst, from malware hooking on kernel32, which brings me to...
One recent issue was what do you do if some security software thinks you are a virus and tries to delete your application? This may mean pushing new code that will work around the problem. That's what we did, within half a day of learning about it so many users may even realized. To achieve this, your release team must always be ready to fire and manual QA must be minimal. I'm not trying to point fingers here, that's the kind of thing that cannot be tested by all security vendors. We just have to live with that possibility.
But still, what if some malware hooks your process, and crashes because it fails to access the sandboxed process? Which brings me to...
Security can be viewed as making sure that no action is done on a user's behalf by a third party that could go against his will. If you have valuable information, people will try to hack you in all possible ways. Spoofing our DNS name, proxies serving fake SSL certificates and on the fly packet translation. We've seen it all. Many of these are hard to address from the web service directly. Having a end-to-end solution helps there, even if it is for a subset of users. You can't just think about security without looking at the overall picture. One important thing we developed is the Google Chrome sandbox. Feel free to meet me after the presentation if you'd like to talk about it.
Always look for the easiest path, that's what crooks do. They are willing to invest a significant amount of time in searching to figure out holes in the security mechanisms. And once they find something, they will use it fast to affect as many people as they can.
How do we address that?
That's simple. Put your bits out as fast as you can. Seriously, our security guys were argumenting about the upgrade ping rate. Is it better to ping at every 3hrs or 5hrs. Doing the check on startup is the worst time you can do that. The system is already CPU and I/O bound. But updating still means for most users to notify the user only if a real immediate risk is present. So the basic idea is to update the system but make it almost transparent and hopefully magical to the user. Subtle enough that most users won't realize but power users will act upon. In both Google Chrome and Chrome OS, we decided to add a little green arrow when an update is ready to be installed.
Enterprises like to control deployment of the software used by their workforce. That's normal, a disruption in their workforce could cost a lot of money to the company. How to conciliate that with forced automatic upgrades? First of all, because of our release pipeline, they can canary themselves a few users on the beta channel or the dev channel, which gives them an astronomical 6 to 12 weeks to find out quirks and if it's not enough, they can delay stable update a bit more if really needed.
We created tools for administrators to delay updates or apply policies to installations of Google Chrome or Chrome OS on their domain. What helps us, and them, here is that each upgrade only represents 6 weeks of changes that occured less than 3 months ago, so the freshness is always relatively good. Second, they have one package to control, not sixty billion small component upgrades to mess with. This reduces a lot the overhead of patches management. In the end, everyone wins with a more secure and simple environment.
Note that I forgot to talk about Build Sheriffs, which are an important part of the cultural team mindset that helps making sure the team is make progress in a cohesive way.
Obi-Wan Kenobi is copyright Lucasfilm