Unhealthy Test Insights

Tests are hard to maintain. Once you write them, they have a tendency to stick around, even when it’s no longer clear what value they provide or when they are hurting more than helping.

SMEs working on maintaining those tests often struggle to make convincing arguments as to what work needs to be done to improve the effectiveness of tests and get frustrated.

The overall quality of tests suffers, and in the worst case, the annoyance of the tests goes too high, and developers lose trust in the tests.

Hence the Unhealthy Tests page in Launchable! This page surfaces tests that exhibit specific issues so you can investigate and make necessary changes.

Unhealthy Test stats are aggregated at the 'altitude' that your test runner uses to run tests. See #Subset altitude and test times for more info on this concept.

Flaky Tests

About flaky tests

Flaky tests are automated tests that fail randomly during a run for reasons not related to the code changes being tested. They are often caused by timing issues, concurrency problems, or the presence of other workloads in the system.

Flaky tests are a common problem for many development teams, especially as test suites grow. They are more common at higher levels of the Test Pyramid, especially in UI and system tests.

Like the fictional boy who cried “wolf,” tests that send a false signal too often are sometimes ignored. Or worse, people spend real time and effort trying to diagnose a failure, only to discover that it has nothing to do with their code changes. When flakiness occurs with many tests, it can make people weary of all tests and all failures—not just flaky tests—causing a loss of trust in tests.

Tests that produce flaky results should be repaired or removed from the test suite.

Flaky Test Insights

To help with this, Launchable can analyze your test runs to identify flaky tests in your suite.

All you have to do is start sending data to Launchable. After that, the Flaky tests page should be populated within a few days.

However, for flakiness scores to populate, you need to run the same test multiple times against the same Build. In other words, you need to have a retry mechanism in place to re-run tests when they fail. (This is usually already the case for test suites with flaky tests.)

Launchable re-analyzes your test sessions to extract flakiness data every day.

Flaky tests are automated tests that fail randomly during a run for reasons not related to the code changes being tested. They are often caused by timing issues, concurrency problems, or the presence of other workloads in the system.

Flakiness score

A test is considered flaky if you run it multiple times against the same build, and sometimes it passes and fails.

The flakiness score for a test represents the probability that a test fails but eventually passes if you run it repeatedly.

For example, let's say you have a test called myTest1 with a flakiness score of 0.1. This means that if this test failed against ten different commits, in 1 of those ten commits, that failure was not a true failure. If you run that test repeatedly, it eventually passes. This test is slightly flaky.

Similarly, another test called myTest2 has a flakiness score of 0.9. If this test failed against ten different commits, in 9 out of those ten commits, you saw a false failure that retry will yield a passing result. That test is very flaky and should be fixed.

Total duration

The dashboard also includes the total duration of a flaky test. Since flaky tests are often retried multiple times, this adds lots of extra time to each test run.

The total duration is useful for prioritizing which flaky tests to fix first.

For example, you might have a very flaky test (i.e., it has a high flakiness score) but either doesn't take very long to run each time, it doesn't run very often, or both. In comparison, you might have a less flaky test that takes a very long time to run -- so you'll probably want to fix that first.

The table is sorted by flakiness score in descending order, not total duration.

Never Failing Tests

Tests that never fail are like cats who never catch any mice. They take up execution time and require maintenance, yet they may not add value. For each test, ask yourself if it provides enough value to justify its execution time. Consider moving the test to the right so that it runs less frequently.

A test must run at least five (5) times in order to be considered.

Longest Tests

Slow tests are like gunk that builds up in your engine. Over time they slow down your CI cycle.

Most Failed Tests

Tests that fail too often are suspicious. Perhaps they are flaky. Perhaps they are fragile/high maintenance. Perhaps they are testing too many things in one shot.