A collection of articles, ideas, and rambling from a guy who wrote some software that one time.

Sunday, February 01, 2009

The Joel Un-test

Joel Spolsky seems to like controversy, although I can see why.  Being a contrarian ideologue is pretty sweet.

Some people have been suggesting that the Joel Test should mention "100% unit test coverage" as well.  Personally, I think that's a great idea.  The industry is evolving, and automated testing is getting into the suite of tools that every competent programmer should be familiar with.

Joel disagrees, since 100% coverage is "a bit too doctrinaire about something you may not need".

For what it's worth, I don't completely disagree with Joel.  Some of the software that I work on doesn't have 100% test coverage, and that's okay.  I wrote it before I learned about unit testing.  I'm not freaking out and spending all of my time just writing tests for old code which apparently works.

However, we do have policies in place to add test coverage whenever we change anything.  Those policies stipulate that 100% coverage is a requirement for any new or changed code, so I consider myself a fan of 100% coverage and I generally think it's a good idea.  I do think it belongs on the Joel Test, or at least something like it.

I feel like my opinions are representative of a pretty substantial number of "agile" practitioners out there, so I'd just like to respond to a few points:

Joel mentions "SOLID principles", as if they're somehow equivalent to unit testing.  As if the sentiment that leads one to consider 100% test coverage a great idea leads one into a slavish architectural death-spiral, where any amount of "principles" are dogma if they have a sticker that says "agile" stuck to them.

Let me be clear.  I think that SOLID, at least as Joel's defined it, is pointlessly restrictive.  (I've never heard about it before.)  As a guy who spends a lot of time implementing complex state machines to parse protocols, I find "a class should have only one reason to change" a gallingly naive fantasy.  Most of the things that Joel says about SOLID are true, especially if you're using a programming language that forces you to declare types all over the place for everything.  (In Python, you get the "open" part of "OCP", and the "clients aren't forced" part of "ISP" for free.)  It does sound, in many ways, like the opposite of "agile".

So, since SOLID and unit testing are completely unrelated, I think we can abandon that part of the argument.  I can't think of anyone I know who likes unit testing would demand slavish adherence to those principles.  I agree that it sounds like it came from "somebody that has not written a lot of code, frankly".

On the other hand, Joel's opinion about unit tests sounds like it comes from someone who has not written a lot of tests, frankly.

He goes on and on about how the real measure of quality is whether your code is providing value to customers, and sure you can use unit tests if that's working for you, but hey, your code probably works anyway.

It's a pretty weaselly argument, and I think he knew it, because he kept saying how he was going to get flamed.  Well, Mr. Spolsky, here at least that prediction has come true ;-).

It's weaselly because any rule on the Joel Test could be subjected to this sort of false equivalence.  For example, let's apply one of his arguments against "100%" unit testing to apply to something that is already on the joel test, version control:

But the real problem with version control as I've discovered is that the type of changes that you tend to make as code evolves tend to sometimes cause conflicts. Sometimes you will make a change to your code that, causes a conflict with someone else's changes. Intentionally. Because you've changed the design of something... you've moved a menu, and now any other developer's changes that relied on that menu being there... the menu is now elsewhere. And so all those files now conflict. And you have to be able to go in and resolve all those conflicts to reflect the new reality of the code.

This sounds really silly to anyone who has really used version control for any length of time.  Sure, sometimes you can get conflicts.  The whole point of a version control system is that you have tools to resolve those conflicts, to record your changes, and so on.

The same applies to unit tests.  You get failures, but you have tools to deal with the failures.  Sure, sometimes you get test failures that you knew about in advance.  Great!  Now, instead of having a vague intuition about what code you've broken intentionally, you actually have some empirical evidence that you've only broken a certain portion of your test suite.  And sure, now you have to delete some old tests and write some new tests.  But, uh... aren't you deleting your old code, and writing some new code?  If you're so concerned about throwing away tests, why aren't you concerned about throwing away the code that the tests are testing?

The reason you don't want to shoot for 90% test coverage is the same reason you don't want to shoot for putting 90% of your code into version control or automating 90% of your build process into one step or putting 90% or (etc) is that you don't know where the bugs are going to crop up in your code.  After all, if we knew where the bugs were, why would we write any tests at all?  We'd just go to where the bugs are and get rid of them!

If you test 90% of your code, inevitably, the bugs will be in the 10% that you didn't test.  If you automate 90% of your build, inevitably the remaining non-automated 10% will cause the most problems.  Let's say getting the optimization options right on one particular C file is really hard.  Wouldn't it be easier to just copy the .o file over from bob's machine every time you need to link the whole system, rather than encoding those options in some kind of big fancy build process, that you'd just have to maintain, and maybe change later?

Joel goes on to make the argument that, if he were writing some software that "really needed" to be bulletproof, he'd write lots of integration tests that exercised the entire system at once to prove that it produced valid output.  That is a valid testing strategy, but it sort of misses the point of "unit" tests.

The point of unit tests — although I'll have to write more on this later, since it's a large and subtle topic — is to verify that your components work as expected before you integrate them.  This is because it's easier to spot bugs the sooner you find them: the same argument Joel makes for writing specs.  And in fact if you read Mr. Spolsky's argument for writing specs, it can very easily be converted into an argument for unit testing:

Why won't people write unit tests? People like Joel Spolksy claim that it's because they're saving time by skipping the test-writing phase. They act as if test-writing was a luxury reserved for NASA space shuttle engineers, or people who work for giant, established insurance companies. Balderdash. ... They write bad code and produce shoddy software, and they threaten their projects by taking giant risks which are completely uncalled for.

You think your simple little function that just splits a URL into four parts is super simple and doesn't need tests because it's never going to have bugs that mysteriously interact with other parts of the system, causing you a week of debugging headaches?  WRONG.  Do you think it was a coincidence that I could find a link to the exact code that Joel mentions?  No, it's not, because any component common enough to make someone think that it's so simple that it couldn't possibly have bugs in it, is also common enough that there are a zillion implementations of it with a zillion bugs to match.

Unlike specs, which just let you find bugs earlier, tests also help you make finding (and fixing) a bug later be cheaper.

Watching a test-driven developer work can be pretty boring.  We write a test.  We watch it fail.  We make it pass.  We check it in.  Then we write another test.  After a while of watching this, a manager will get itchy and say, Jeez!  Why can't you just go faster!  Stop writing all these darn tests already!  Just write the code!  We have a deadline!

The thing that the manager hasn't noticed here is that every ten cycles or so, something different happens.  We write a test.  It succeeds.  Wait, what?  Oops!  Looks like the system didn't behave like we expected!  Or, the test is failing at a weird way, before it gets to the point where we expect it to fail.  At this point, we have just taken five minutes to write a test which has saved us four hours of debugging time.  If you accept my estimate, that's 10 tests × 5 minutes, which is almost an hour, to save 4 hours.  Of course it's not always four hours; sometimes it's a minute, sometimes it's a week.

If you're not paying attention, this was just a little blip.  The test failed twice, rather than once.  So what?  It's not like you wouldn't have caught that error eventually anyway!

Of course, nobody's perfect, so sometimes we make a mistake anyway and it slips through to production, and we need to diagnose and fix it later.  The big difference is that, if we have 100% test coverage, we already have a very good idea of where the bug isn't.  And, when we start to track it down, we have a huge library of test utilities that we can use to produce different system configurations.  A test harness gives us a way to iterate extremely rapidly to create a test that fails, rather than spinning up the whole giant system and entering a bunch of user input for every attempt at a fix.

This is the reason you don't just write giant integration tests first.  If you've got a test that just tells you "COMPILE FAILED", you don't know anything useful yet.  You don't know which component is broken, and you don't know why.  Individual unit tests with individual failures mean that you know what has gone wrong.  Individual tests also mean that you know that each component works individually before inserting it into your giant complex integrated compiler, so that if it dies you have a consistent object that you know at least performs some operations correctly, which you can inspect and almost always see in a sane internal state, even if it's not what the rest of the system expects.

Giant integration test suites can be hugely helpful on some projects, but they are the things which are sometimes unnecessary gold plating unless you have a clear specification for the entire system.  Unit tests are the bedrock of any automated testing strategy; you need to start there.

Unit tests seem like they take time, because you look at the time spent on a project and you see the time you spent writing the tests, and you think, "why don't I just take that part out?".  Then your schedule magically gets shorter on paper and everything looks rosy.

You can do that to anything.  Take your build automation out of your schedule!  Take your version-control server out of your budget!  Don't write a spec, just start coding!  The fact is, we pay for these tools in money and time because they all pay off very quickly.

For the most part, if you don't apply them consistently and completely, their benefits can quickly evaporate while leaving their costs in place.  Again, you can try this incomplete application with anything.  Automate the build, but only the compile, not the installer.  Use version control, but make uncommitted hand-crafted changes to your releases after exporting them.  Ignore your spec, and don't update it.

So put "100% test coverage" on your personal copy of the Joel Test.  You'll be glad you did.

One postscript I feel obliged to add here: like any tool, unit tests can be used well and used poorly.  Just like you can write bad, hard-to-maintain code, you can write bad, hard-to-maintain tests.  Doing it well and getting the maximum benefit for the minimum cost is a subtle art.  Of course, getting the most out of your version control system or written spec is also a balancing act, but unit tests are a bit trickier than most of these areas, and it requires skill to get good at them.  It's definitely worth acquiring that skill, but the learning is not free.  The one place that unit tests can take up more time than they save is when you need to learn some new subtlety of how to write them properly.  If your developers are even halfway decent, though, this learning period will be shorter than you think.  Training and pair-programming with advanced test driven developers can help accelerate the process, too.  So, I stand by what I said above, but there is no silver bullet.

13 comments:

Andrew Dalke said...

Here's one I've been thinking about during the last week. I wrote a MySQL extension as a user-defined function written in C++. This is a shared library which is run-time loaded into the MySQL server.

My test suite, writing in Python using MySQLdb, has near 100% coverage of the C++ code. The only thing I couldn't test were the 1 malloc and 4 new failures, which should only happen when the machine runs out of memory. The code is essentially:

if (allocated_space == NULL) {
strcpy(message, "this_function: out of memory");
return 1;
}

How could I write a coverage test for this? I can test it manually by changing the "==" to a "!=" to make sure the code path executes correctly, but that's not automated.

I could write something which replaces malloc and injects failures at the right place, but then I would have to use it for all of mysqld or I would have to make it part of a stub driver which loads the shared library and emulates the proper calling sequence.

I could do the latter, but it would take time and I would have to integrate the new code into the otherwise pure Python-based test system. The entire project took 6 days. If getting 100% coverage requires another day of work then the cost to my client goes up by 15%, all to check an additional 10 lines of code. As it is, I have about a line count of about 2,000 and roughly 1,000 logical lines of code, so my coverage is 99%.

Yes, I know about projects like SQLite where they wrote code to implement strange file-system and memory allocation errors, but the cost for doing that for this project, for a very unlikely occurrence, when there is no existing framework, does not seem justifiable.

Should I have just not tested for NULL? But I want my shared library to be a good citizen inside the database and not be the direct cause of a server crash. In the unlikely case there is a memory problem, at least there will be something in the log that helps point to the crash site, and MySQL has its own reporting mechanism for out-of-memory cases. It's possible to use their "my_malloc" as a replacement for "malloc", in which case I would be able to use that mechanism, but that doesn't address my four new calls.

By the way, I followed the link to your policies (with the wiki title UltimateQualityDevelopmentSystem) but saw nothing about code coverage.

glyph said...

@Andrew, that's a very compelling counterexample, but I feel like it's the exception that proves the rule. (The rule being "if you don't have tests for everything you work in a bad place.")

Sometimes one can find oneself in an environment where it's so hard to write tests that the write decision is to not write the tests. But, this applies to the other parts of the Joel Test as well; one can find oneself in an environment where it is so hard to use version control or a bug-tracking system that the right decision is not to use them. That doesn't make them good places to work :).

Even in reasonable places, occasionally an exceptional situation arises. For example, you might be at a place that uses version control, but be faced with a file containing a gigabyte of binary data which would be inadvisable to check in to the trunk of the repository. I don't think that we should say that is a good reason to have "you should evaluate each file to see if you should check it in" as a rule of thumb for using version control.

You might be right about this test being so hard to write. If so, MySQL and C++ are bad places to work, because testing there is unnecessarily difficult.

Maybe they're not that bad though. Consider this snippet of MySQL documentation which describes how to write tests for plugins.

My personal suggestion, in your exact scenario, (assuming that the linked document doesn't help) would be to not test for NULL, but to file a ticket and/or put an "XXX" in the code, explaining that this should probably be dealt with. Error-reporting code is actually the most important to test, because it is so infrequently executed. A very, very common error I've found is that the code is initially correct, but the initialization of 'message' gets shuffled around so that 'strcpy' is no longer valid; it still builds, it still seems like you're protected against NULL returns, but now your error detection code is actually going to make diagnosing the real problem harder, because your coredump is going to have you failing on "message" rather than "allocated_space".

Andrew Dalke said...

In this case the right decision was to put the code into MySQL. The goal is to support queries that include a bit of chemistry-oriented functionality (does the compound contain an aromatic ring? how similar is this compound to another compound?) when searching a molecular compound database.

Queries also include expressions like "only compounds from this supplier" and "molecular weight must be between X and Y", which are straight SQL calls, so the ability to combine those with the chemistry functions is very useful. In addition, the clients are in Java and Python, which is fine because both can speak to MySQL. Finally, scalability isn't a concern because there's at most 200 registered users. This really is a niche domain.

I think there's a misunderstanding. It's not at all hard to write tests against my server extensions. That was straight-forward with MySQLdb and unittest. I reconnect to the server for each test and my total test time is still under a second. Using the link you pointed out would have been worse because I would have to code my tests in C instead of Python. This really isn't a difficult environment. The only thing I couldn't test was for "new" failures in C++ code.

The only difficulty was handling the out-of-memory errors in the test suite.

You say that if it's hard to write a test then a given environment is a bad place to work. What about the Python code base. Should the developers stop working in C? I know there are error code branches which aren't tested.

For that matter, take a look at unittest.py (and other stdlib modules). It contains an "except KeyboardInterrupt: raise". Do you think there are tests for that code path? How would you write those tests? Does that mean Python is a bad development environment? I notice that there are a few similar KeyboardInterrupt: raise handlers in the Twisted code. ;)

Were I to follow your suggestion, who do I file a ticket to? To MySQL saying there needs to be a way to test that case? To our bug tracker? If the latter, what's the resolution? Or is it meant as a reminder or later, in which case I would rather put it as a comment in the code.

I feel that it's easier to comment "this code branch is not tested by the test suite; if you make changes you must do manual tests." After all, the manual test really is to change the "== NULL" to a "!= NULL" and make sure the code gets executed correctly (correct message, and correct return code).

You commented "but the initialization of 'message' gets shuffled around". I know what you're saying but there's some internal details about plugins that's relevant here. The "message" buffer is allocated by MySQL (with a guaranteed size which is greater than 80 characters) and passed to the extension code as a parameter. I didn't allocate it and that reshuffling case you pointed out can't occur.

It does feel like the MySQL extension API was designed to minimize failures in error handling code as there really is very little that can go wrong with that code branch.

So in this case I'm happy with 99% code coverage, and don't see the cost-effectiveness, much less usefulness, of 100% coverage.

illume said...

hi,

you make many good points.


There is a limited amount of resources, so you need to follow a certain path.

eg, To spend a day writing another feature, or testing something 100%.

If you consider the feature that hasn't been added yet a whole bunch of failed tests, then deciding to work on 100% coverage for one module seems a bit silly.

As an aside, it's impossible to get 100% test coverage with python. Since you do not know what errors can be raised -- you can't test for them all. Also python uses many non-deterministic algorithms - so many times your code will not follow the exact same code paths.

For prototype code, or code that needs to be written within 1 day, or code that will only ever run once -- aiming for 100% tests slows you down. There's trade offs for all sorts of situations, and for many situations tests are just not needed, let alone 100% unittest coverage.

However for many cases, aiming for 100% test coverage is obviously very useful, however there's also many cases where it's not useful.

Thanks for your article, I enjoyed reading it... and it brought up many ideas.

Andrew Dalke said...

Illume - that's exactly one of the points Joel makes:


But, I feel like if a team really did have 100% code coverage of their unit tests, there'd be a couple of problems. One, they would have spent an awful lot of time writing unit tests, and they wouldn't necessarily be able to pay for that time in improved quality.


The Joel test says that if more than a couple factors in the yes/no list are not present then there are serious problems in the software development process. Glyph wants "100% code coverage" to be one of them, and less than 100% means that "you work in a bad place."

I am testing (the modern definition of "proving") that rule by a counter-example, to argue, as Joel wrote "100% unit tests of all your code ... strikes me as being just a little bit too doctrinaire about something that you may not need."

Thinking about it, under this definition I shouldn't have even checked the malloc return values, because the 10 minutes it took to write the code and check it by hand wasn't worth the time for a case that has a very low likelihood of actually occurring. In that way both Joel and Glyph would be happy. (perhaps?)

But the pedantic programmer in me says I must check malloc returns and report an error. Just like I do for every other function call.

mithrandi said...

Uhm, oops. Blogger: your UI sucks!

"The one place that unit tests can take up more time than they save is when you need to learn some new subtlety of how to write them properly."

The problem I often seem to find myself in is that I don't just need to learn "some new subtlety", but I have absolutely no freaking clue how to write tests for certain things; and I either have no luck finding anyone else that can explain to me how it's done, or I find out that doing it will require writing an entirely new testing framework / tool / whatever because nobody has already done it. Ouch.

Over a long period of time, I've managed to knock one or two items off the "no clue how to test this" list, but I'm still haunted by a lot of my code being testless.

glyph said...

@Andrew: You're asking a lot of good questions here, but I think to comprehensively answer them all here would strain the bounds of blogger's comment capabilities :). Therefore I apologize for a somewhat incomplete answer.

To briefly answer the most salient points, though:

You say that if it's hard to write a test then a given environment is a bad place to work. What about the Python code base.

I think my opinion of Python's test coverage is a matter of public record; you can find my posts on the python-dev mailing list. I respect the right of the python developers to make their own process decisions, but I don't believe that ignoring the buildbots, allowing untested commits, and not fixing failing tests has served them well. This is regardless of language choice. On the other hand, this process seems to serve them well enough; I use Python every day and it works well enough.

On the other hand, I personally know that many successful video game titles have been created by people who did not use version control :). Success is not defined by following some arbitrary set of guidelines.

My main point here is that the arguments Joel is using against 100% coverage are somewhat specious, not that there are no reasons to miss 100% coverage. As I keep saying, there are valid reasons not to use version control or write specifications 100% of the time too. These situations are usually a sign that something else is wrong (the version control system is insufficiently flexible, you don't have the budget for a decent spec writer, you're using an inflexible system which is difficult to test) but sometimes it's not worthwhile to fix the underlying problem.

Should the developers stop working in C?

Yes. I think Python should be written in Python. I hope that PyPy will have enough success that eventually CPython will become redundant. Of course, we should not blow up the world in the meanwhile; I am not saying "let's get rid of CPython right now because it doesn't have 100% coverage".

I notice that there are a few similar KeyboardInterrupt: raise handlers in the Twisted code. ;)

To my knowledge, those cases are covered by tests in Twisted. They should be covered by tests within Python :). Again, unit test coverage (raising the exception and making sure the code path does something like what's expected) is not the same as integration testing (actually sending a SIGINT to the process and watching its output).

Were I to follow your suggestion, who do I file a ticket to?

Your own bug tracker. And using a bug tracker for everything is Joel's rule, not mine ;-).

I don't believe that it's worth adding a branch of code that you're not going to test. If it's not worth testing that the server won't crash in this obscure condition because it's so unlikely, just go ahead and let the server crash in this obscure condition, since it's so unlikely :).

manuelg said...

I would write the "über-Joel" rule as:

"100% unit test coverage, unless you are willing to advertise to your customers exactly which sections of code only have 99% test coverage"

Frankly, if I was Andrew Dalke's customer, I would give him a pass on his memory run-time error in a MySQL extension. If Dr. Dalke has customers that are not as lenient, then he should make a choice between that particular customer's patronage and the effort to build a comprehensive test rig.

(I am sure Dr. Dalke will take my suggestion like he would any other bit of unsolicited advice from the Internet peanut gallery.)

Andrew Dalke said...

Dr. Dalke? I don't have a PhD.

Interestingly, I came to my client's site this morning and saw they had problems with their nightly regression tests. The disk became full so write() failed. A very rare event, but because Python checks for it, the traceback was helpful. There isn't a test for "disk is full" in Python, but that code path does executes for any write failures. It is a reminder that rare cases can happen.

So I will continue to test for malloc failures, report them as appropriate, and test that code path by hand. I do test the code, just not automatically.

The KeyboardInterrupt was a flaky example. I looked at twisted/conch/client/default.py and I see now that it's pretty old code ("xreadlines()"?). But I will say that replacing the KeyboardInterrupt on line 215 with KeyboardInterrupt2 does not change the error count when I do the "trial" regressions.

A better and more direct example would be to see where Twisted uses malloc, which is very rare. One example is

+++ twisted/protocols/_c_urlarg.c (working copy)
@@ -50,6 +50,7 @@
/* output = cStringIO() */
output = PycStringIO->NewOutput(length);
if (output == NULL) {
+ exit(1);
return NULL;
}
end = s + length;

You can see I added the exit(1) but it was never called by the unit tests. Again, this code is very old (2001-2004) and likely pre-the 100% coverage emphasis.

There's iocpsupport.pyx which is new code, and apparently in use. It has

res = <myOVERLAPPED *>PyMem_Malloc(sizeof(myOVERLAPPED))
if not res:
raise MemoryError

As far as I can tell, the tests never check that failure case, but I can't run the tests to check that out.

The file _epoll.pyx does a malloc (shouldn't that be PyMem_Malloc?) and does not check the return result.

I think that means 2-out-of-3 Twisted files agree with me ;)

glyph said...

@Andrew,

For what it's worth, Twisted fails most of the points on the Joel Test as well. All of them except for points 1 and 4 (and, I guess, 11). We do use source control and we do have a bug database, and you can't contribute to Twisted without writing code.

So, we're not perfect on the 100% test coverage thing, either. But we should be. The C modules in question should be fully tested. If anything the standards on testing C modules should be tighter than on Python modules, because it's so easy to make a mistake. In fact, I'm pretty sure that one of the untested sections of _c_urlarg was at one point responsible for a pretty serious security issue (which is now fixed).

The problem I have with Joel's stance on this issue is that I think that his reasoning can be used to justify a "just test when it's easy" strategy. The value of tests goes up a lot as you hit 100% coverage. If you have a system with, say, 80% coverage, it's not like you can change 80% of the system with confidence. It's more like you can change any part of the system with 80% confidence, because you don't know what is connected to that untested 20%.

However, you're obviously right in that sometimes, lack of test coverage just doesn't matter. Sometimes the costs will outweigh the benefits. However, it's rarely the tests themselves which have introduced the cost. Any code which is flexible is easy to test. If you want to test what happens when the allocator gives you a NULL, all you need is a pluggable allocator. If your allocator is static and can't be modified, then yes, you have a problem: but that's a problem with the allocation API, not with unit tests. You can minimize untestable code and wrap the allocator in a function which you can then instruct to return NULL in your test suite.

Ultimately, if I were going to put a rule on the Joel Test, honestly, it would not be about coverage. It would be something like "Do you write tests before you write code". If you're doing TDD properly, 100% coverage is a side-effect.

Andrew Dalke said...

BTW, the Joel Test has "Do you use source control?" not "Is everything in source control?" But that's not what I wanted to say.

I was thinking more about your statement that "If you're doing TDD properly, 100% coverage is a side-effect." I don't think that's true.

Refactoring complicates things. Running the unittests against the refactored code only say that the existing tests didn't break. It doesn't say that there are new lines of code which aren't tested.

To be pedantic, one of Fowler's refactoring transformations is "Substitute Algorithm". The new algorithm may be clearer but have edge cases which weren't in the old algorithm, and therefore have code branches which are not covered by the old unittests.

Here's a trivial example, which is vaguely like Fowler's example for "Substitute Algorithm" only in reverse. I decided to rewrite:

int hex_to_decimal(char c) {
if ('0' <= c && c<='9') {return c-'0';}
if ('A' <= c && c<='F') {return c-'A'+10;}
return -1;
}

as (a hypothetically faster)

int hex_to_decimal(char c) {
switch(c) {
case '0': return 0;
case '1': return 1;
...
case 'E': return 14;
case 'F': return 15;
default: return -1;
}}

The original function was written with TDD but the original unit tests did not enumerate all possible hex values, because there wasn't a perceived need. It only checked the obvious edge cases and a couple of other cases to be on the safe side. Why check for '7' when it's obviously correct if '0' and '9' work?

The new function with the same unittests does not have 100% coverage.

This is a trivial example for a small amount of comment space - I'm sure you can come up with a better one.

As far as I can tell, this refactoring was done in full TDD style, yet it doesn't have 100% coverage at the end. I know how to get 100% coverage, but only through doing coverage tests, not through TDD.

(Inspecting the unittests to see if they suffice is a form of manual code coverage, and I think not part of TDD. I could add switch cases one-by-one and see if at least one additional test passes after each time, but that assumes very fine grain unittests with one letter per test. It is also very tedious.)

glyph said...

We should really do a panel at a conference or something :).

I don't think that the refactoring you described is good testing practice, but I think the point is very subtle. Especially since the popular literature says "red/green/refactor".

Personally I don't think that this is a good way to go. Refactorings should either have a new, interesting, testable property, or they are a waste of time. In other words my personal testing discipline is "red/green/red/green". I typically add a new test when I refactor, for whatever the interesting property is of the new factoring.

Your unit tests (presumably) began with a test for each code path. In your original function there are 3 codepaths. In the new function there are 17.

Minimal coverage would dictate that the original implementation would have at least 3 tests to cover each path. However, minimal coverage is not necessarily ideal coverage. I perceive that there would be a need to test all inputs :).

TDD treats the unit tests as an executable specification. The specification for a hex_to_decimal function would presumably include all hex digits.

Of course, this gets into some really tricky territory. Pretty much all the reasoning I've done about TDD treats arithmetic as a black box. In this specific case, I think I'm right, but for example, you can't practically unit-test an arbitrary-precision operator by testing the infinite range of addition operations that it can perform. So I can't say I've got all the answers :).

Andrew Dalke said...

I don't think I'm that good at panels. I don't think fast enough for verbal exchanges unless I've practiced what I want to say enough times. It took me several hours to come up with what I wrote here, including discussion with friends about the ramifications of 100% coverage. At the very least I would need to have the discussion topics beforehand so I can make notes. I won't be at PyCon, but I will be at EuroPyCon ;)

One of the people I talked to was Geoff Bache. He's the author of TextTest, which one of my clients uses. That testing system includes timing coverage for the tests. It shows a warning if a test becomes too slow or too fast. This is important because we do algorithms development, including code that runs for non-trivial time. (Tens of seconds to several minutes.)

Performance is a measurable properties, but not one handled by most testing tools meant for TDD.

Unittest doesn't handle it at all. Nose has a "@timed" decorator but that has a hard-coded time value. But in our code we have calculations which are dependent on other calculations, so doubling the performance of one low-level algorithm can have downstream effects on dozens of tests. TextTest lets us automate changing the base timing information when that happens.

Even in Kent Beck's "Test-driven development" he points out that they use JProbe to do coverage testing of their code, and full TDD still missed a case. (p86 and 87 are the only places which mention code coverage. He writes "TDD followed religiously should result in 100 percent statement coverage" immediately followed by a counter-example.)

So yes, I agree that there should be a measurable property, but I point out that the essential TDD practices don't include performance testing. That same section of Beck's book even says that TDD is not a replacement for performance testing.

I chose my hex example based an essay titled Two-stage tables for storing Unicode character properties. It in turn is commentary on a chapter in "Beautiful Code" which uses a series of switch statements for performance (Example 5-7 to be precise.) The "B.C." author then rewrites that as a large table lookup. The "two-stage" commentary rewrites that using a "two-stage table."

If I didn't know about two-stage tables, I might be tempted to autogenerate large switch statements to handle Unicode properties. I've definitely auto-generated binary decision trees in C, where there are easily 1000s of code paths. That's testable, but because I'll machine generate the tests, they are likely to fail in exactly the same way as the code.

Even converting code to tables is fallible, because that just moves state to some place invisible from coverage testing tools. There's 100% coverage, but that's insufficient.

And as you say, some input domains are just too large to cover exhaustively.

I think at this point we've also exhausted this topic. :)