What the Data Tells Us: When it Rains, it Pours
Test Data Findings and Analysis by Kohsuke Kawaguchi
Key Takeaways
A test that failed recently is very likely to fail again soon.
You can learn a lot from looking at data you already have, and it has a practical impact on your developer productivity.
This article was originally published on Kohsuke's LinkedIn
As late as 19th century, blood-letting was a common practice. It was “obvious” that illness can be prevented or cured by withdrawing blood from patients. Today we know that that’s the price we pay when we let our belief and common sense make decisions, instead of learning from data.
At Launchable, learning from data is what we do day in day out. It’s really interesting to put our hypothesis and our common sense to test, and apply the lessons to produce practical value. So today, I wanted to share some of those findings with you.
Misery loves company
When we looked at data we collected from customers, one thing became clear. A test that failed recently is very likely to fail again soon.
It makes some sense, right? I’m sure every developer has a personal experience where your first attempted change broke some tests, you tweak the change a bit but you didn’t quite fix all of those failures, and it takes a few repeats of this process to fully get a clean green build. We can expect that kind of short burst of failures, say in a time span of up to a few days.
But what might surprise you is that just how nicely this trend extends over much longer time horizon.
Take a look at this graph. With this graph, we asked ourselves “when we see a test failure, what are the chances that it had failed within the recent N days?” Three different lines are for three different customers we’ve done this analysis for. Each line represents all the test executions for a given project, across multiple engineers.
You see blue line starts at about 65% percentile at 1 day. What that means is that when a test fails with this customer, 65% of the times that same test had failed within the last 24 hours. That’s surprisingly high, isn’t it?
But it doesn’t stop there. As you can see, the effect continues over much, much longer time horizon. Notice that X-axis is log scale, but the lines look basically straight. If failures are happening randomly, which you expect after a while, the graph will not look like this.
This data suggests that not only does it make more sense to run the most recently failed tests in your PR, but it also makes sense to run tests that your colleague had broken two weeks back.
Data reveals value
This just goes on to show that when you look at data that you already have, you learn a lot from it, and it has a practical impact to your developer productivity – in this case, figuring out the right smoke test to run for your test.
That you can see the same trend across different teams means there’s little point in everyone doing this analysis independently and solving this problem independently. That’s one reason why I founded Launchable. To solve this problem generally, like I did with Jenkins.
And just imagine if this one little analysis leads to this much value, what else we can do if we focus more efforts looking into data. That’s exactly what we do here at Launchable!