Canary Threshold Calibration

A score is calculated based on various algorithms, and a user-specific threshold for “good” and “bad” is used to automate a decision of overall health.

This section describes a method to calibrate the score to an application to determine the thresholds. It will focus on a simple score from 0 to 100, where a higher number indicates more behavioral similarity between the baseline version and canary version.

How Thresholds Work

Autopilot has two user-defined threshold values, one for a failure level and one for a pass level. Scores are compared to these threshold values, and if it is below the failure level the canary is marked as a failure, and if it is above the pass level, it is marked as a pass. The middle area is treated as a region where an automated decision is not available, and typically involves a human-made decision about its health.

The goal is to find values for these thresholds such that most decisions are automated, and the fewest possible false decisions are made by the system.

Every application may have some variance in its metrics and logs between instances, even if fed identical traffic and performing identical work. The more natural variance, the wider the range between pass and fail needs to be to account for uncertainty.

Thresholds are application specific. The thresholds from one application may be useful as a starting point on a similar application, but in general they are not going to be the same across every application. One good heuristic to determine thresholds is to perform A/A testing.

A/A Test for determining thresholds

A good method to find this internal variance is to perform an A/A test. This test runs a canary but uses the same version for both baseline and canary. By running one or more of these tests, a range of values can be found for an application. If five tests were run, and the values found were 45, 52, 54, and 60, and we can determine manually these are all good, we can reasonably set the fail value slightly below 45 (say, 40) and pass at 45.

If it is not easy to run multiple tests, a single value (say, 54) can be used as the pass threshold, and a lower value as the fail threshold.

In both cases, these are initial values, and as more canaries are run with different versions, the thresholds can be adjusted.

Adjusting Thresholds

As more canary runs are performed, new scores can be used to tighten the uncertain region. Currently, this is a manual process.

Suppose thresholds of 40 and 70, and a canary score of 51. If this is known to be a failure after manual determination, the fail threshold could be moved up to somewhere between 40 and 51, to make automated decisions occur more frequently.

The threshold levels are a trade-off between automated decisions and false decisions. The smaller the middle region where a manual judgement must be made, the more likely a false decision will be made in an automated way. The wider this region, the more manual decisions will occur.

Last updated