The table is currently sorted by flakiness score in descending order, not total duration.
A test is considered flaky if you run it multiple times against the same build and sometimes it passes and sometimes it fails.
The flakiness score for a test represents the probability that a test fails but eventually passes if you run it repeatedly.
For example, let's say you have a test called myTest1
which has a flakiness score of 0.1. This means that if this test failed against 10 different commits, in 1 of those 10 commits, that failure was not a true failure. If you run that test repeatedly, it eventually passes. This test is slightly flaky.
Similarly, another test called myTest2
has a flakiness score of 0.9. If this test failed against 10 different commits, in 9 out of those 10 commits, you saw a false failure that retry will yield a passing result. That test is very flaky and should be fixed.
The dashboard also includes the total duration of a flaky test. Since flaky tests are often retried multiple times, this adds up to lots of extra time running tests.
The total duration is useful for prioritizing which flaky tests to fix first.
For example, you might have a test that's very flaky (i.e. it has a high flakiness score) but either doesn't take very long to run each time, or it doesn't run very often, or both. In comparison, you might have a test that is less flaky but takes a very long time to run -- so you'll probably want to fix that first.
The table is currently sorted by flakiness score in descending order, not total duration.