Sunday, October 3, 2021

A Bug Story

User reports an application crash that occurs when using a certain software component ("DataFile") but which only occurs when *not* connected to the test hardware. The DataFile component has nothing to do with hardware - it provides a proxy for a data file and allows the user to access the floating-point data in that file via array indexing notation (with the actual file reading happening transparently and the data being cached).

So the fact that a problem is being reported that correlates with a hardware connection (or lack of one) seemed reminiscent of the stories like the car that had trouble starting whenever the owner went to the store for *vanilla* ice cream, but not other flavours (http://www.snopes.com/autos/techno/icecream.asp). Or the password that didn't work if the user was standing up.

It turned out that the user was grossly misusing the DataFile component. It is intended for use with a file with N floats per line, separated by spaces or commas, etc. The user's file had one line with a string containing 30,000 0's and 1's, with no spaces or other delimiters. The low-level file reading and parsing is done via a standard Python library function. That function dutifully converted the string from the file into a float - but since it treated it as a float with 30,000 digits, the result was IEEE infinity. So my code got back an array with one entry - that infinity. And I had a bug in my error-checking code in the DataFile component (wasn't expecting infinity) that caused a Python exception.

So I fixed the error checking code, added some more validation to the DataFile component to prevent users misusing it like this, and started to type up my bug fix report. Then I remembered the weirdness about this only happening when *not* connected to the hardware and investigated that. I have some code that tries to determine whether a hardware connection is necessary for what the user wants to do with our tool. Some of the components are purely computational but others "talk" to the hardware to make measurements. So I parse out the variable names that the user refers to in their program, and loop through the list of variable names, checking to see if any of them are components that need to talk to the hardware. That code did a lookup of the variable name followed by a check to see if the lookup succeeded. The latter was testing the object  pointer from the lookup in the way that you might do it in C:

	if objPtr:
		doSomething()

That works fine in Python as well for testing if for a null object pointer. The problem is that in Python it does a bit more. If the object pointer is not null, it checks to see if the object contains anything. E.g. if checks if a string contains any characters. This sort of thing is often what you want and so it is convenient. But in the case of the DataFile class, it did something which I hadn't considered. The DataFile class implements a 'len' operator which allows the user to find out how many records are in the file by applying the 'len' operator to the DataFile instance. And so the 'if objPtr' line triggered a call to the 'len' operator. But that triggered a reading of the file (since the data is only read on demand and then cached). So my check to see if the variable was of a class that needed a hardware connection was triggering a reading of the data file - which caused an exception due to my bad error checking. And this code was "so simple" that I didn't bother checking for exceptions there, causing the application crash. And of course this whole check was only done if the user ran the program when *not* connected to the hardware since if it's already connected to the hardware there is no need to check if an connection is needed.

I fixed that problem by explicitly checking for a null object pointer (as I should have done before). Bug report done. Fixes checked into source code control. One last thing - check that the sample programs illustrating the use of the DataFile component are still working. Boom - first one fails right away. It turns out that my data caching code had forgotten to set the "good data" flag in one case. So the first time you asked for a data entry using the DataFile component, it read the file and then threw the data away. Only on the second time the file was read did the cached data stick around. And this egregious bug hadn't been noticed (for months!) since the data file was getting read inadvertently at program startup due to the 'if objPtr' bug I described above.

No comments:

Post a Comment