Time complexity tests verify the order of growth of time complexity T(n) for various operations, generally verifying that this is O(1) or O(n), rather than O(n2), say.
These are a form of performance test (see Adding Performance Tests), but since they have a binary answer (satisfies the bound or doesn't), and we are concerned with the overall growth, not the absolute speed, they are instead technically in the Blink Layout Tests, and produce PASS/FAIL.
These are potentially flaky, and layout test failures break the CQ, so please take extra care in testing and committing these.
Tests consist of:
You can have several tests in the same file (they should be related); simply have several
You can set some test parameters, specifying what range of magnitudes to test and how strict the tests should be. This is optional, but generally necessary. This is because the tests default to quite strict tolerances, and start testing from n = 1, which is usually too low, since constant terms often dominate for low n.
The design reason for requiring parameters is to make the data of "over what range and to what tolerance the growth rate holds" an intrinsic part of the test: by specifying these parameters, someone reading the test file can see at a glance how big input needs to be to be "close enough to infinity", and how noisy the data is.
You can also specify more trials, or a different cutoff for overall success, but this is generally not necessary.
To set the parameters:
When committing, safest is to:
It is ok, and often a good idea, to add a bit of a margin of error (or extra test runs) if you're concerned about false alarms, but be careful to not make the test so lax that it always passes. The test should fail if something changes for the worse; that's what it's there for.
You will primarily vary the parameters
The parameters are as follows:
Range of magnitudes
You will usually need to modify these, as for low magnitudes low order terms (especially constants) often dominate, obscuring the high order terms (linear, quadratic, exponential) that we're interested in, but we don't want to test for 220 to 230, say, both because this is slow, and has noise from memory management once you get this big. Usually around 28 to 215 is a good range, but it varies.
Usually leaving these at defaults (3 trials, 50% success threshold) is fine, and instead modify statistical parameters (trim and tolerance).
You generally want an odd number of trials, so you can use majority rule:
"Never go to sea with two chronometers; take one or three."
(quoted in Fredrick Brooks, The Mythical Man-Month, p. 64),
and 3 trials is usually enough.
50% (majority rule) optimizes the overall trade-off between sensitivity and specificity (true positive rate and true negative rate). If you have the same rate for both, the complimentary rates (false negative rate and false positive rate) are attenuated linearly (as powers, i.e., on the log scale) in number of trials needed for success/failure. By "linear attenuation" we mean like "decreasing decibels of noise": if false negative rate is α and false positive rate is β, then requiring j successes for overall success or k = (n − j) + 1 failures for overall failure makes the false negative rate approximately (ignoring constant factor) αj and false positive rate approximately βk, which in log terms is j log α and k log β. For n = 3 and requiring 2 successes for overall success (hence 2 failures for overall failure), this doubles the attenuation of the false negative and false positive rate, which is usually enough.
In rare circumstances you may want to change these:
At each magnitude, the test is run until this time is used up, then averaged. You can increase this if you have a slow test, or at running at high magnitudes where you're not getting enough iterations to get a good average. Consider if you can instead decrease the maximum magnitude.
Given the times for all magnitudes, the test statistic (see below) is computed by first trimming the most extreme values (basically outliers in the times, actually derived values like ratio between successive times), then computed and checked against the tolerance. These default to quite strict values – no trim and 5% tolerance – which are quite safe, but usually too strict. You will thus need to relax them and specify how far from model the actual data is. Common reasonable values include:
Beware of trimming too much, as you're effectively reducing
Larger tolerances are ok when testing for O(1) – even values of 100% or more are ok if you're only interested in checking that it's not linear or worse (rather than not log) – but if you start getting to 50% (at which point you're not rejecting O(log n)), you have very noisy data.
For linear and polynomial tests – O(n) and O(n2) – the key numbers to look at are the log-ratios (base 2), which are effectively the exponent of the order of growth at that step. Thus for linear growth this should be almost 1, while for quadratic or greater growth it should be 2 or more.
For example, given the following data:
...so we see that at first it grows slowly (.45), then super-linearly (1.46), then superquadratically (2.43, 2.50, 1.95 is close), then slows down but still superlinear (1.28), and overall it deviates substantially from linear (2.50 differs from ideal of 1.00 by 1.50, which is the Maximum Absolute Deviation). This is an extreme case (normally there'll be a trend), but shows that this growth isn't linear over this range of inputs.
The code at
You can get very detailed diagnostics and intermediate calculations by running in the TestRunner (?), or setting an invalid expectation (e.g., expecting CONSTANT for data you expect to be LINEAR).
In outline, testing proceeds as follows.
We time the run time of the tested function at exponentially spaced magnitudes, e.g., 20, 21, 22, 23, ....
We then verify if these times look like the time complexity we're expecting (constant, linear, or polynomial (quadratic or greater)).
Tests are robust, non-parametric statistical tests, since timing is noisy (so need to be robust), and noise can take various forms (so non-parametric, since no particular model of noise).
In more detail:
Tests are all of the form: "use trimmed maximum absolute deviation (from model) as test statistic", meaning "compare this value against tolerance, and succeed if error doesn't exceed tolerance". In detail:
For constant, this compares each value against the median.
(If deviation from median is only 20% or 30% as you vary n from 1 to 1,024, any linear or higher term is very small.)
For linear and polynomial (quadratic or more), at each step, check what doubling the input does to the time – for linear it should double it, while for polynomial it should quadruple it or more. Taking log (base 2), this says the log of ratios of successive times (equivalently, difference of successive logs) should be 1 (linear) or more than 2 (quadratic or more).
This is effectively looking at slope of successive steps on the log-log scale, as the slope of a monomial on the log-log scale is its exponent.
These are simple tests, not sophisticated ones (we're not doing robust non-parametric linear regressions on the log-log scale, say), but they are thus transparent, and should be sufficient. More sophisticated tests that are still easily implemented include doing Ordinary Least Squares (OLS) on the log-log scale, and doing this iteratively (discarding point with greatest residual and repeating).
When writing tests, watch out for: