It’s something that irks me.
My observation has been that developers are resistant to accepting the challenge of battling the scourge of intermittently failing tests.
What is the reason for this resistance? This is something that has intrigued me for years. And I’d like to delve into it in a little more depth.
I suspect that part of the human impulse to resist addressing non-deterministic tests is the intuition that to accept such a challenge might lead to much time and effort expended for little result. If that is the case, believe me, I understand. Fixing this sort of problem is not for those faint of heart!
Having experienced the struggle of identifying what causes an intermittent failure, I know it isn’t easy to unearth the root cause.
In my experience there is a palpable resistance to Martin Fowler’s suggested approach of quarantining non-deterministic tests. On one level, this bewilders me. As Martin illustrates, if a test cannot be relied upon to pass or fail, it is worse than useless. It is infecting the whole test suite. So, the sane approach is to, at least temporarily, remove it from the test suite.
Whenever I have suggested this, I have been met with resistence. Developers claim that the test is useful, that it provides protection as part of the regression suite against bugs being introduced. I’m still searching for a way of countering this argument, which is clearly fallacious. As Martin quite rightly asserts, if a test cannot be relied upon to pass or fail against the same codebase, it is worse than useless and must be immediately quarantined!
Sure, there is extra effort involved in configuring the quarantining process. The use of RSpec tags is handy for this. Then there is the perceived risk of the team forgetting to fix the quarantined test. Again, it requires some effort to set up, but it is certainly possible to automate warnings to the team about tests that have been quarantined for too long as well as build pipelines that contain too many quarantined tests.
Of course, another possible response to a troublesome non-deterministic test is to simply remove it from the suite.
This may sound radical, but let’s consider the situation from a cost/benefit perspective. If a test cannot be guaranteed to reliably pass or fail then it is clearly not providing much benefit. If it takes considerable effort to debug and still cannot be guaranteed to reliably pass or fail, what should the developers make of it? It has clearly already absorbed considerable cost. This leaves the question of potential benefit.
A related question is: how crucial is this test as part of the suite? Is it a vital part of the test suite? If the answer to this question is “yes” then it is appropriate to continue trying to solve the non-deterministic behaviour. However, if the answer is “no”, there seems little value in keeping the test within the suite.
People are naturally lazy. Why do something that requires effort unless you really have to or there is a clear benefit that will accrue to you?
If a developer is working on a pull request, pushes a commit and the resultant build on the CI server fails with an error that is obviously unrelated to their pull request, what’s the easiest thing to do? Click the button to rebuild the test, of course! The temptation to take this action, even if the developer in question has the appreciation that this may not be a helpful response in the even slightly longer term for his or her colleagues, can be compelling.
I’m not sure of the best way of countering this. Appealing to the greater good?
Usually when a developer notices an intermittently failing test their focus is elsewhere. They may be working on a feature and notice that the build fails due to a non-deterministic test that is unrelated to their feature. Or they may be watching the master build in preparation for a deployment. Understandably the priority in these scenarios is to enable the feature to be merged or the deployment to go ahead.
The key here is to at least take some action to fix the non-determinism, even if it is to schedule some work to rectify the situation later. Unfortunately, in my experience, the tendency is often to ignore the intermittently failing test.
As Martin Fowler points out, there are many causes of non-deterministic tests. Among them are:
Added to this, Keith Pitt has detailed 5 ways we’ve improved flakey test debugging, which focuses more on how to capture the database state when a test fails intermittently.
They are all worth reading. However, what I’m pondering in this article is more to do with motivation.
How can we best encourage developers to tackle non-deterministic tests?
Expecting the person who first notices the failing test to fix it is probably not a good approach. After all, that person is likely to be feeling frustrated or even angry that an unrelated test failure is holding up their progress.
A colleague of mine recently suggested assigning the task of fixing a non-deterministic test to the last person who changed that test. It’s a helpful suggestion that at least circumvents the frustration that I wrote about earlier. It also assumes that the team member assigned to fix the flaky test is willing and capable.
Of course, keeping a CI build healthy is a shared responsibility. The build is more likely to be healthy if all members of the team contribute to meeting the challenge posed by non-determinism.
In my case, one thing I need to be mindful of is to be careful not to let the frustration that I sometimes feel become counter-productive. As Kent Beck implied a while back, as well as working in small increments, it is important to be both kind and honest.
If I follow that advice hopefully I will respond to discovering non-deterministic tests by gradually finding ways to help the team to handle the challenge of fixing them more successfully.
Recently I responded to a request for help with setting up a new machine on which to develop Rails applications.
A little while later I was pleased to hear that Rosie Williams had completed setting up her new Mac and was ready to continue developing the Rails app that Matt Allen had helped her get started with at a Rails Girls event.
As you can see from the Twitter exchange below, I also cheekily asked Rosie whether she had all her tests passing.
Now I’m not one who always practices strict TDD. I don’t always start with a failing test and work towards enabling that test to pass. Sometimes I do but other times I find more value in experimenting with a solution in other ways, e.g. via a REPL or a browser, before writing an automated test.
And there are times that I don’t bother with automated tests at all. However, those situations are definitely in the minority.
If I am working on a production system that an organisation is dependent on for more than a brief period of time, I consider automated tests to be vital.
Why? The simple answer is that software needs to be malleable.
Developers need to be able to make changes to a software system to meet changing needs of business. To allow a software system to adapt to changing requirements, the design needs to be continually improved. Integral to such refactoring is a sufficiently comprehensive suite of automated tests. Without such tests there will be too great a risk of new defects being introduced.
Recently I have been responding to various reports from users, all impacted by problems in a subsystem of a large Rails codebase.
It has been a classic case of defects caused by software that is far too complex for the essential job it was designed to do. To my eyes, it is begging to be refactored. To work with the code as it is now, it can take a disproportionate amount of time to debug a problem due to the challenge in deciphering what it is doing. It is clear that the internal design of this code needs to be improved so that it is easier for the next person to understand. So I intend to refactor this code in the near future.
Like many things, software is prone to entropy over time. Without careful maintenance, it is likely to degrade as bandaid solutions are progressively applied.
Refactoring is a key part of enabling software to successfully adapt. Automated tests are crucial to refactoring.
Unsurprisingly, there are differing views about TDD. Without going into depth here, I will allude to some interesting discussions between Martin Fowler, Kent Beck and David Heinemeier Hansson a few months ago, which I summarised in a talk I gave at a Sydney Ruby meetup in July.
I hope this post has been useful, particularly to newcomers to Rails development who may have been wondering what all the fuss is about TDD. There is much more to say about testing and refactoring but for now I’ll end this post by recommending:
One thing that the Rails community prides itself on is that test-driven development (TDD) is widely employed. But is this sufficient to produce good quality code that stands the test of changing requirements over a long period of time?
In my recent roundup of Ruby resources, Sandi Metz is one author I recommended. In her excellent talk at the 2009 Gotham Ruby Conference, Sandi contested that TDD is not enough. Sandi’s talk, which concentrates on applying SOLID design principles in Ruby, is well worth listening to in its entirety.
As Sandi explains, central to SOLID design is avoiding dependencies by striving for code that is:
As she demonstrates applying these principles through refactoring a small Ruby application, Sandi shows that, if we want code to endure in a state that makes it fun to work with, we need to do more than use TDD. As Sandi refactors she emphasises that when we reach a point where the tests pass, we should not be satisfied. Nor should ensuring that the code is DRY be enough. Other questions that need to be asked about the class under test are:
Only when the answer to all of these questions is “yes” should we move on.
I know these are questions I should ask myself more often. As Sandi stresses, “test driven development is good, but it’s not enough.”
The fifth Australian Open Source Developers’ Conference was held last week in Sydney. In addition to helping organise the conference, I was fortunate enough to be one of the Ruby presenters.
Naturally, these slides were designed to assist my presentation rather than contain all the content. Indeed, the inclusion of some of the slides may beg some questions so I thought it may be helpful to add some explanation here.
Each of us programmers is on a specific journey, especially when it comes to testing. Early in my professional career I was taught how to unit test program modules written in PL/I. However, the drivers for those tests had to be written in Job Control Language (JCL) – an experience I am not in a hurry to repeat.
Many years later, having escaped from working on mainframes, I discovered JUnit. This was a fundamental tool in my early experience of test driven development and refactoring. When I began exploring Ruby and Rails, I was eventually introduced to autotest, which I considered another quantum leap forward in the realm of automated testing.
In 25 minutes there was obviously a limit to the number of Ruby testing tools I could cover. So, having quickly explained the benefit of autotest and touched upon Test::Unit, I moved on to describe some tools that I have used in the last year.
To make sure the audience was still awake, at this point I showed a cute photo of our family dog. My lame excuse was that he exhibits a wide range of behaviours and RSpec is all about specifying behaviour. My main example of using RSpec was for specifying a controller. This led on to a brief digression into consider what makes a good test and the use of mock objects to isolate unit tests and make them faster by avoiding unnecessary database I/O.
I was pleased to be able to include Cucumber, Webrat and Selenium in my talk. It’s only fairly recently that I started using Cucumber in conjunction with Webrat or Selenium and I’m impressed. As Mike Koukoullis showed in his subsequent talk, developing with Cucumber is a very powerful approach, which fosters clear description of intent before development of any feature.
Speaking of other talks, Alister Scott used a lightning talk to share his enthusiasm for Watir, which looks like a good alternative to Selenium.
After briefly relating the motivation for developing alternatives to relying on fixtures for test data, I described Machinist, an elegant tool recently developed by Pete Yandell. When used in conjunction with Sham, Machinist provides a neat way of generating “blueprints” that can be used in Cucumber steps.
To round out my talk, I thought it was important to offer a few philosophical thoughts. In a nutshell my view is that, whilst it is important to remember that automated testing is just one of many error removal approaches, we can all benefit from investing in automated testing.
In my case, as well as practicing using these tools, I’m also looking forward to reading the upcoming title The RSpec Book by David Chelimsky and others.
“And test everything. EVERYTHING!”, came the shout through the email thread.
The discussion had started with another developer’s assertion that you should “mock everything you’re not specifying” but quickly diverged when a third developer responded with the words “and if you don’t have time to test everything … prioritize”.
Before I relate my own humble contribution, let me step back and give some perspective to and explanation of the topic.
We are talking about testing software. In particular, we are talking about testing software in an automated fashion. My first encounter with automated testing was with JUnit back in 2001 when I was involved with Extreme Programming (XP) practices on several Java projects. I don’t intend to describe my experiences with XP here but, suffice to say, I found the main strength to be the emphasis on Test Driven Development. The idea is that you write a test for a unit of software first, fully expecting it to fail. Then make the test pass. Continuing along this path, you build up a suite of automated tests and ensure that they all pass before committing any change to the code-base. Later on, if a bug is discovered, what should you do? First, create a test that exposes the bug and then fix the bug so that the test passes.
Hopefully the uninitiated now have a better idea of automated software testing. Now, what does “mock everything you’re not specifying” mean?
Let’s deal with “specifying” first. This term relates to Behaviour Driven Development. As Dave Astel explains, “it means you specify the behaviour of your code ahead of time.” It’s a change of emphasis. I should point out that the discussion I’m referring to was amongst a group of Australian Ruby programmers. In Ruby, RSpec is the de facto Behaviour Driven Development framework. RSpec started out with an emphasis on specifications or “specs”. Later a “Story Runner” framework was added but this has been recently deprecated in favour of Aslak Hellesoy’s Cucumber, which will soon be incorporated into RSpec. But I digress. Essentially, a tool like RSpec gives the developer a syntax to specify how software should behave.
What is meant by the term “mock”? In software development mocks are “simulated objects that mimic the behavior of real objects in controlled ways.” If you’re interested in a more lengthy explanation, see Wikipedia’s entry – it’s actually quite informative. Essentially, mocks are useful for testing units of software in isolation.
Are you still with me? Good. That’s the explanation over with. Now I want to return to the debate about whether or not a developer should test everything. As a developer, it is very easy to become zealous about automated testing once you’ve experienced success with it. Surely adding more tests and increasing coverage can only be good, can’t it? After all, more bugs will be found and fixed and the quality of the software will be easier to maintain over time.
I question this thinking. As I said when I added my two cents worth to the group discussion:
“Whilst it is a good thing to strive for better automated testing and greater test coverage, the effort required needs to be assessed in a business context. There are times when the effort cannot be justified.”
I know I want to improve my proficiency with automated testing tools. I’m pretty comfortable with RSpec but know that Cucumber, Selenium and jsunittest are all tools that I want to add to my testing toolbox.
But I have to balance the effort required to learn new tools and write automated tests with the benefit to be gained and the business imperatives. Long time software practitioner and researcher Robert L. Glass asserts the following in his book, Facts and Fallacies of Software Engineering:
“Even if 100 percent test coverage were possible, that is not a sufficient criterion for testing. Roughly 35 percent of software defects emerge from missing logic paths, and another 40 percent from the execution of a unique combination of logic paths. They will not be caught by 100 percent coverage.”
I agree with Glass when he says “producing successful, reliable software involves mixing and matching an all-too-variable number of error removal approaches, typically the more of them the better.” We should not be seduced by automated testing to the extent that we are prepared to pour endless resources into writing tests. Judgement is required. There should be a balance between automated tests (or specifications and scenarios) and destructive manual testing. When a deadline approaches, we should be prepared to go into “technical debt” as far as automated tests are concerned, as long as we commit to recovering that debt later.
So no, I don’t think we should always aim to have automated tests that cover all of our code.