20050308

Tales of CatastrophicFailover

Martin Fowler's blog on CatastrophicFailover reminded me of a classic example.

I was working on a military command system project about 10 years ago. The project was in the 'systems shakedown' phase as it would now be called, which involved doing lots of things that the local population complained about, great fun but nauseatingly slow project. After 4 years and 200 developers working shifts, we had a reasonably good system. The core was in Ada (draconianly typed, excellent in other ways), interfaces to hardware were in C and C++ and the system consisted of 200 nodes each with several processors and truck loads of dual redundancy to resist physical damage.

One fine day in the main operations room something happened. One by one, each of the control consoles in the room rebooted, you could see that it was happening in sequence but at a very high rate. Ada is a very robust language, the strong typing means that runtime exceptions should never happen, as a result the default exception handler is normally set to reboot the system in this context. This had never happened in the simulated 'facility'.

Obviously, what was happening was that one console was being rebooted by an exception, but as part of its systems handover it was propagating the cause to its failover node, which was in turn being tripped up by the same problem. The system used a core memory replication system as a means to implement lossless hot failover, very effective but prone to this problem. Just to make matters worse the error could only be replicated (or so we thought) during periods of high activity, and so the background logging thread had not had time to record the state of the node yet and so the sophisticated replay system was useless.

Step one was to avert the domino effect. So extra exception handling was temporarily added to prevent the exception from causing the meltdown. And then extra logging was added to try and discover the source.

It did not take long to find the source, at least the line of code. The read of a variable that represented the direction of a target was the culprit. But this variable had been set in another node by another part of the software that had in turn read it from a hardware interface. Why then in 99.999% of cases did it work and then under conditions of heavy load it was being tripped about once a day? Heads were scratched for weeks, until one day I was in the 'facility' at a strange time of day doing some very 'quiet' testing of systems and (as we were still running the old build) the consoles rebooted in sequence. We only had a low load on the system that day and I noticed that one of the targets had just passed north....

The next day I went back to the nice warm development site and started ploughing through the source of the relevant sections of code, tracing the variable back through the layers of software to its hardware origins and see why north might be significant. I discovered that the variable was being converted back and forward between radians and degrees in several different layers, but this did not answer my question. Then I took a close look at the value of PI being used, and in the C code the value was specified to one more decimal place than the Ada definition. And that was the answer, the numerical accuracy of the C conversion was different from that of the Ada so in rare circumstances the Ada could be given a value of more than 360.0 degrees and fall over in a heap.

The reason why I am writing this now is that I am in the process of integrating an enterprise standard Java calculation engine with various J2EE based platforms, and despite so called standardisation - this problem of botched numerical accuracy still exists in code today. At least Java's exception handling conventions are more forgiving.

No comments: