We have a lot of layout test failures. For each test failure, we have no good way of tracking whether or not someone has looked at the test output lately, and whether or not the test output is still broken or should be rebaselined. We just went through a week of rebaselining, and stand a good chance of needing to do that again in a few months and losing all of the knowledge that was captured last week.
So, I propose a way to capture the current "broken" output from failing tests, and to version control them so that we can tell when a test's output changes from one expected failing result to another. Such a change may reflect that there has been a regression, or that the bug has been fixed and the test should be rebaselined.
Note however that the overhead should trend to zero as we get closer to zero test failures. Realistically, we should expect there to be expected failures for quite some time, and so anything we can do to work with them more effectively is probably a good thing.
Also, I am working on a separate change to store the expected images on the server, so that they don't have to be pulled down locally into the tree.
First, let's think about how one writes tests. Typically, there are two approaches.
The first, and most popular these days, is to write a self-contained test that checks its own output and simply announces "pass" or "fail". This is in fact the recommended way to write tests in WebKit, and is how the xUnit style of tests are usually written.
The second, is to separate the test from the output, and to use a driver that checks the output against an expected result (or baseline) to determine if the test passed or failed. This is how run_webkit_tests works.
Most people prefer the first approach because you have fewer files to maintain, and the purpose and correctness of the test is more obvious. However, in some cases (e.g., pixel tests in the renderer), this simply isn't possible (or, at least, practical).
Both approaches, however, have drawbacks, both in the normal case and in the "we expect this test to fail" case.
The second approach, obviously, is just the second workaround to the first approach, codified into different files. The second approach is perhaps preferable where multiple correct results are the norm, rather than the exception, which is why we use it to compare PNGs across multiple platforms.
Now, in the case where tests fail, one can actually view the failure as a different kind of "correct" - i.e., we know the output is wrong, but it's an "expected" diff, and in most cases, we want to know if the diff changes from what we expect. Perhaps we actually fixed the bug? Perhaps we introduced a new bug? In fact, one could argue that platform-specific baselines are "expected" wrong baselines.
Tracking "expected diffs" introduces its own woes - what if the diff output is not deterministic? Or, and more importantly, how do you distinguish "expected wrong diff" from "expected right diff"?
Lastly, one could argue that we should spend more time fixing the bugs that cause the diffs, and less time tracking diffs :) Unfortunately, it's a lot faster to baseline expected diffs then it is to fix them :(