Speeding Up Automated Tests, So They Don't Bottleneck the Quality
Launchable Guest Blog Post on 5 Steps to Cut Test Execution Time
Key Takeaways
Automated tests, originally intended to improve quality, could get in the way of quality as the test time grows.
So how did we cut the test execution time to one-third without sacrificing the quality? Here are five steps we took:
Deleted legacy tests
Educated the team about how to deal with flaky tests
Skipped tests under certain circumstances
Fixed slow tests on file by file basis
Optimize common routines that are used by all the tests
Guest post by keik_117, translated by Kohsuke Kawaguchi. Originally posted in Japanese in freee Developers Hub. freee is a Japanese web service company producing online accounting & HR solutions.Hi, I’m Keik and I’m an engineer assigned full time on software quality effort in freee HR.
Freee HR uses CircleCI to run automated tests in its application development. Every change goes through GitHub pull requests, and one of the criteria for merging a pull request is to pass automated tests.
This means every change needs to wait until the result of automated tests. However trivial it is, or however urgent it is. At the same time, the codebase is growing by the day, and so is the time it takes to run automated tests.
Therein lies the growing pain.
Automated tests, originally intended to improve quality, could get in the way of quality as the test time grows.
Specifically, think of the following scenarios:
Developers get discouraged from making small changes because they have to wait for automated tests
Urgent changes get slowed down because they have to wait for automated tests
Just imagine the two development environments. One takes 30 mins to run its automated tests, the other 10 mins. It just occurred to you to, say, add a little bit more comments in the source code, or just tweak the subject of the message a little. I bet your willingness to put those little thoughts into action will be considerably different between those two environments. Furthermore, the speed at which you can release a change directly influences key DevOps metrics like “lead time” or “mean time to repair” (MTTR).
Fortunately, here at Freee HR, we are happy with the time it takes to detect production anomalies and the duration of a deployment process. That means if automated tests are taking 30 mins, that’ll be a more pronounced part of the MTTR.
Thus it becomes important to shrink the execution time of automated tests, but herein lies another challenge.
If we reduce the automated test execution time in the wrong way, quality suffers. An obvious case of this is just to delete slow tests. This is a story of how we are at Freee HR took the effort to reduce the time to run automated tests. The numbers like 30 mins and 10 mins, that I mentioned above, actually are our real numbers before and after the effort.
5 Steps to cut test execution time
So how did we cut the test execution time to one-third without sacrificing the quality? Here are five steps we took:
1. Delete legacy tests
I mentioned earlier just removing slow tests is not a great idea, but removing tests is good if we are talking legacy tests.
Different people probably have different takes on what a “legacy test” is, but here I’m talking about tests that exhibit symptoms of legacy code like (1) the intent of a test is unclear, (2) the test is complicated and hard to understand, (3) test is getting duct-tape patched left and right.
For example, in this effort, I actually deleted a large number of parameterized RSpec tests built on top of the request spec. Those are tests that hit public HTTP endpoints and validate requests & responses.
These tests were parameterized with a large CSV where each row represents a parameter and expected response. That might not sound so bad, but from creating test data to obtaining the expected golden result, everything goes through public HTTP API, which made these E2E tests in practice, which just happens to avoid web browsers.
Approaching parameterized tests that are supposed to test details including corner cases, by using methods of E2E tests, that’s just not a great combination of WHAT and HOW. On top of that, this test had a convoluted test-specific logic to map a value in CSV into a request object. This resulted in operational headaches, such as difficulty in comprehending what caused the test to fail, and thus whenever this test fails, developers routinely and mindlessly overwrote the expected values by the observed values, or in some cases tests were simply skipped for unclear reasons. Put simply, this test has too many problems.
Those are motivations enough to remove this test, but still, it is difficult to remove a test. That’s because a test is protecting something, and so it’s hard to ascertain that it is safe to remove this test.
So we collected evidence to make this judgment call. First, how often is this test getting updated?
1 % git --no-pager log --pretty='format: %ad' --date='format:%Y' -- spec/requests/integration | sort | uniq -c
2 86 2015
3 50 2016
4 34 2017
5 20 2018
6 14 2019
7 12 2020
8 6 2021
The pace of change is going down by the year. Upon further inspection of the changes, in the most recent 3 years, most of the changes are unrelated to business logic, such as code formatting changes, feature flag addition/removal, and so on.
I can see that a well-encapsulated business core logic might have tests that remain stable and unchanged, but this test we are thinking of removing is on the stage of parameterized testing and thus we’d expect to see more test parameters added as new features are added. The last 3 years saw a lot of new feature development, and so if a test isn’t seeing the addition of new parameters, it’s safe to say this test is dead as opposed to stable. I further verified that tests written for new features during this time are mostly written as a test against new business core logic classes. From there, I came to the conclusion that the team got better at encapsulating responsibilities, which resulted in a reduced need to do parameterized tests at the E2E layer. The final nail of the coffin is the code coverage; Removing this test only resulted in the loss of 0.2% of coverage, which led me to conclude that what this test has been testing is largely covered by other finer-grained tests. All of that analysis allowed me to delete this test.
The removal of this one test led to 80 minutes reduction in test execution time. We run RSpec tests with 16 concurrent jobs, so the saving on the wall clock basis is 5 minutes.
2. Educate the team about how to deal with flaky tests
Previously, the way we dealt with flaky tests, which are tests whose outcomes are unstable, is to simply rerun CI jobs until they pass.
We changed this and asked developers to share suspected flaky tests with the whole team, and to the extent possible, fix them as they find them. My QA team led this by example by fixing flaky tests eagerly. This resulted in a mindset change in the team, who now see that even a single flaky test is a problem.
Me being Mr. Flaky Tests
"This test might be flaky"
"I’m looking into this suspected flaky test"
"Looks like a flaky test"
Let me know if you find a flaky test, instead of just rerunning. Share it in this channel, then click “Rerun Workflow from Failed” in CircleCI or ignore flaky failures and force merge, if you have people near you with that permission. Don’t “Rerun Workflow from Start” or amend commit + force push, as that will cause the whole thing to be rerun
You can just rerun flaky tests if you find them, but if you let this channel know someone might go fix them.
We no longer oversee flaky tests, and CircleCI’s flaky test page shows zero flaky tests.
Also, even if you do rerun a CI job, we ask devs to just rerun a failed job as opposed to a whole workflow.
“Rerun Workflow from Failed” for rerunning a CircleCI job
3. Skip tests under certain circumstances
Freee HR uses a mono repo that includes both the frontend and the backend code. Some pull requests only touch one but not the other, so they don’t need to run tests for the side it didn’t touch. Thus we decided to look at what changed in a pull request and skip test jobs accordingly.
This is a simple idea, but one has to be careful with the condition to skip tests. As a matter of fact, a similar scheme adopted in another repository inadvertently skipped tests that should have been executed and it led to a production failure.
To implement the skip condition of tests, we need to understand the dependencies between the source code. For example, if the development is schema-driven, then the schema is used by both the frontend and the backend, so if this file changes, both tests need to run. Another example of the care that’s needed is “if backend changes, run backend tests” isn’t the same thing as “if a change is frontend only, skip backend tests“. IOW, rather than defining when tests should run, we want to define when tests should be skipped. This assures safety when more files are added in the future.
Communicating that tests were skipped so that developers and reviewers are aware of it is also important to avoid a false sense of safety. To this end, we post a comment to a pull request indicating test jobs are skipped:
Pull request comment stating test jobs were skipped
For your reference, here is the example of a CircleCI configuration to (1) skip backend tests if a change is front end only and (2) post pull request comments if test jobs are skipped.
11skip_job_if_changes_only_in_frontend: &skip_job_if_changes_only_in_frontend
22 name: Skip job if changes are only in frontend files
33 command: |
44 # Skip if not triggered from a PRR
55 if [ -z "$CIRCLE_PR_NUMBER" ]; then exit 0; fi
66
77 ISSUE_COMMENT_MARKER="generated by $CIRCLE_JOB job"
88 FRONTEND_FILE_PATTERN='^(\.eslintignore|\.eslintrc\.js|\.jest|\.storybook|front|package\.json|tsconfig\.json|webpack\.config|yarn\.lock)'
99
1010 # NOTE: $BASE_BRANCH isn't a builtin env var for CircleCI
1111 # so you'd need to obtain this out of band, for example from GitHub API
1212 if git diff --name-only $(git merge-base origin/$BASE_BRANCH HEAD) | grep -vP $FRONTEND_FILE_PATTERN; then
1313 echo 'There are changes except frontend files. Continue job...'
1414 else
1515 echo 'There are no changes expect frontend files. Skip job.'
1616 # Add a comment to PR stating that a job is skipped
1717 # In case of prallel job execution, we only want one comment, so do it only from the 1st container
1818 if [ "$CIRCLE_NODE_INDEX" -eq 0 ]; then
1919 # Don't let our comments from overloading the PR. Delete old comments
2020 curl \
2121 -s \
2222 -X GET \
2323 -H "Authorization: token $GITHUB_TOKEN" \
2424 https://api.github.com/repos/$CIRCLE_PROJECT_USERNAME/$CIRCLE_PROJECT_REPONAME/issues/$CIRCLE_PR_NUMBER/comments \
2525 | jq '.[] | select(.body | test("'"$ISSUE_COMMENT_MARKER"'")) | .id' \
2626 | while read COMMENT_ID; do
2727 curl -s \
2828 -H "Authorization: token $GITHUB_TOKEN" \
2929 -X DELETE \
3030 https://api.github.com/repos/$CIRCLE_PROJECT_USERNAME/$CIRCLE_PROJECT_REPONAME/issues/comments/$COMMENT_ID
3131 done
3232 # Add a new comment
3333 curl \
3434 -s \
3535 -X POST \
3636 -H "Authorization: token $GITHUB_TOKEN" \
3737 https://api.github.com/repos/$CIRCLE_PROJECT_USERNAME/$CIRCLE_PROJECT_REPONAME/issues/$CIRCLE_PR_NUMBER/comments \
3838 -d '{"body":"The change was frontend only. Skipping <code>'$CIRCLE_JOB'</code> for '$CIRCLE_SHA1'<!-- '"$ISSUE_COMMENT_MARKER"' -->"}'
3939 fi
4040 circleci-agent step halt
4141 fi
4242jobs:
4343 rspec:
4444 steps:
4545 - attach_workspace:
4646 at: ~/
4747 - run:
4848 <<: *skip_job_if_changes_only_in_frontend
4949 ...
4. Fix slow tests on file by file basis
Here at Freee HR, our backend tests use CircleCI’s parallel test execution support to run tests in 16 ways.
CircleCI test parallelization optimizes for the reduction of job execution time and automatically assigns tests to create even buckets. In the case of RSpec, the unit of assignment is a file, which means if there are some test files that are outsized, that becomes the bottleneck and more parallelization won’t improve the speed.
Here’s an example of RSpec job execution showing an uneven split before we took on this effort, from 2021 Oct.
Execution times of containers are uneven
16 parallel containers took 272 minutes total. The optimal split should result in 272 / 16 = 17 mins, but in reality, it’s taking 33.8 minutes. That’s twice the optimal time, and we can say that unevenness is high. Containers that ended up with long test files dragged things down.
Further analysis revealed that there’s no single test file that’s taking more than 17 minutes, so I do think the algorithm for bin split is not optimal here. My perception is that CircleCI isn’t doing the optimal job when tests have a large variance in their execution time.
Here’s an example of a bin split after we evened out tests in files, from 2022 Oct.
Visible improvement to the evenness of the container execution times
16 parallel containers took 138 minutes total. The optimal split is thus 138 / 16 = 8.6 minutes, and the actual number is 9.5 minutes. Comparing before and after, the total execution time drastically went down, but separate from that, we can see that bins are a lot more even. We can further increase the parallelism and reduce the test execution time.
It is a problem that some test files are slow. But this begs the question: How do we find them and how do we improve them?
Find slow test files
First, we gotta find slow test files. At Freee HR, we use rspec_junit_formatter as an RSpec reporter, so we parse the resulting XML files and sum up test execution time by their files.
Here’s an example of a report:
11<?xml version="1.0" encoding="UTF-8"?>
22<testsuite name="RSpec" tests="1221" failures="0" errors="0" time="532.492096" timestamp="2022-03-25T18:05:01+09:00">
33 <!-- Randomized with seed 12467 -->
44 <properties/>
55 <testcase classname="spec.models.break_setting_spec" name="BreakSetting#valid? break_type が空の場合 is expected to eq ["休憩種別が未入力です。"]" file="./spec/models/break_setting_spec.rb" time="0.021405"/>
66 <testcase classname="spec.models.break_setting_spec" name="BreakSetting#valid? break_type が有効な値の場合 is expected to eq []" file="./spec/models/break_setting_spec.rb" time="0.001795"/>
77 ...
88</testsuite>
9Using irb, we can read XML, sum them up, and sort them:
10
11
121require 'rexml'; pp Dir.glob('/Users/kato-kei/Downloads/RSpec*.xml').map {|a| doc = REXML::Document.new(File.new(a)); REXML::XPath.match(doc, '/testsuite/testcase').group_by {|el| el['file']}.map {|k, v| [k, v.sum {|a| a['time'].to_f}]}.to_h }.reduce {|acc, a| a.reduce(acc) {|acc, (k, v)| acc.has_key?(k) ? acc[k] += v : acc[k] = v; acc } }.sort_by {|k, v| v }.to_h; nil
And that lead to slow test files:
1{"./spec/<test file 1>"=>1.3e-05,
2 "./spec/<test file 2>"=>0.00608,
3 ...
4 "./spec/<test file 7>"=>41.653222,
5 "./spec/<test file 8>"=>143.082863,
6 "./spec/<test file 9>"=>454.253146}
Improve slow test files
Based on my own experience fixing multiple such files, here are the three main reasons a single test file takes too much time:
1. Tests of different granularity are mixed together in a test case, such as model spec vs request spec.
Model spec is an example of a layer in which finer-grained unit tests can be comfortably written (though Ruby on Rails' model layer is notorious for all sorts of responsibilities so this is not always the case), and so a single test case having just one expectation normally doesn’t result in loss of performance. OTOH, Request spec is an example of a coarse-grained test layer, so having just one expectation per test case is luxurious. When you have multiple expectations against the same input condition, we can gain performance by combining them into one test case.
2. Too much unnecessary test data for a test case.
RSpec DSL is powerful. Hierarchical conditional context, test case extractions, and reuse. When too many of those are used and the structure becomes too complicated, a developer loses track of what’s going on overall and ended up creating lots of unnecessary test data. Here is a concrete example. A test for Active Record scope (a feature that assigns names to query conditions) was creating 100 rows in DB before the test as test data. The problem is that only a few test cases actually needed 100 rows, but this data creation was happening as global before the hook, and this file contained more than 200 test cases! By limiting the scope of this large test data creation, we gained performance.
3. There are simply too many test cases.
One reason this happens is, like the second case I mentioned above, a test file became too big & complex, people unintentionally ended up running tons of test cases. Another cause is the combinatorial explosion of cases when different axes are independent. The worst is when these problems happen with request spec tests, so layered context and/or shared_examples_for & it_behaves_like need particular attention. By narrowing down what needs to be tested, we can gain performance.
5. Optimize common routines that are used by all the tests
In RSpec, it is common to use spec_helper.rb or something to define the common before/after hook for all test cases. The cost of such a hook is multiplied by all the test cases, so even a small amount of overhead could potentially have a huge impact.
For example, at freee HR, we have about 20,000 test cases. If a common hook takes 100ms to run, the total overhead is 2000 seconds = 33 minutes. Even in 16 degrees parallel execution environment, that’s 2 minutes worth of wall clock time.
Take a look at that. You just might find something surprising.
Cost reduction that improves developer experience
What actually led to this effort to reduce the test execution time was the business effort to cut down the cost, in particular the SaaS cost for development.
When you hear cost-cutting, you might think of something negative, like your boss breathing down on your neck. But I don’t believe in taking something that’s producing value and reducing its cost and its value. That’s giving up engineering. Instead, look for the cost that “shouldn’t be there” that’s not leading to value, and you will find them right in front of you. Eliminating those not only save money but improve your developer experience and quality.
If that’s too much mumbo jumbo, just think of automated test execution time reduction as a challenge that has a nice simple cause and effect, and measurable output. It’s a perfect subject of engineering. Take it on to flex your muscle, change your mood, to solve a puzzle.