OdeToCode IC Logo

Blackouts and race conditions

Sunday, April 11, 2004

Every time I write multithreaded code I worry. Perhaps it is because I read stories like the final report on the U.S. / Canadian blackout of August 2003. The problem was a race condition:

"There was a couple of processes that were in contention for a common data structure, and through a software coding error in one of the application processes, they were both able to get write access to a data structure at the same time," says Unum. "And that corruption led to the alarm event application getting into an infinite loop and spinning."

Seems simple, but consider:

"We text exhaustively, we test with third parties, and we had in excess of three million online operational hours in which nothing had ever exercised that bug," says Unum. "I'm not sure that more testing would have revealed that“.

Stories like the above, and about the Therac-25 machine which killed three people, always make me worry I’m missing something. I’ve never worked on radiation therapy machines or power grid software, but years ago I did some development for a machine that would take your blood pressure using a real cuff, and I’ll admit to being apprehensive about sticking my arm in for the squeeze.

The other reason I worry is because it is nearly impossible to test for race conditions.

Functional testing and stress testing may catch race conditions with some luck, but these bugs are hard to bring to the surface. Black box testing in general, with no knowledge of the implementation, has little chance to uncover a race condition except as a side-effect.

Unit testing is also little help. By definition, unit testing focuses on a small and specific piece of code and there is no interaction from other areas of the software. In addition, the exact timing conditions required to trigger a race condition bug have little chance of happening during a single-run pass or fail test.

In the end, a thorough code analysis (more sets of eyeballs) is the best tool to uncover race conditions. I remember a recent jaybaz blog entry showing code I’ve seen and written many times now in .NET and I’ve never thought about the potential race condition there. Sigh. Just goes to show race conditions are the most subtle bugs I know of.