Debugger is not a naughty word (by )

Computers are famed for harbouring bugs, and the high rate of failures in software compared to other industries is a constant cause of embarrassment. I'd like to explore why this is, with an example. And what we might be able to do about it.

Note: Although a lot of the details of the remainder crash are unfortunately very technical, I have done my best to explain things in a way that lay people should be able to make some sense of. However, some things would require a lot of background information, in which case I've just plowed on without explanation. So if you come across things that you don't understand, feel free to skip ahead a bit; you shouldn't lose too much.

A bug: The Reminder Crash

One of my recent jobs was writing a data storage server. I won't go into any more details of it here, other than to make the following pertinent points:

  1. There are instances of the server running. The server opens an on-disk database for writing, and makes changes to it when told to do so by clients.
  2. There are clients that read the data direct from the on-disk database, but ask the server instances to perform updates for them.
  3. Sometimes servers ask each other to send them blocks of records (namely, when recovering state after a crash or planned outage). This processing is known as 'reminding'.

A problem was reported to me: One of the servers in the system was not starting up properly. Instead, it would stick in the "recovering" state, meaning that it has started up, requested information it believes itself to be missing from another server, and was sitting waiting for it.

So, the first thing I did was to examine the system logs to find out which server was supposed to be reminding it, as a message is logged when a server requests reminding. I logged into the computer running that server, and found no remind operations in progress.

So I manually started a remind, and to my dismay, it quickly aborted with the dreaded error:

  Segmentation violation (core dumped)

Now, the structure of the reminder process is thus:

  1. Open the archive table. This is a database file listing when each record in the database was last modified, sorted on the modification time. It contains the modification time, and the name of the record.
  2. Create something called a 'cursor', a sort of movable pointer into the table, and position it at the record just after the time which the server is requesting reminding from.
  3. If a record name was found, then fetch the record with that name from the on-disk database (in the same way as any client would) and remind the asking server of it.
  4. Move the cursor on, to find the next record.
  5. If there is none, stop. If there is one, but its modification time is more than the last time the requesting server is interested in, stop. Otherwise, go back to step 3.

The 'core dumped' means that the operating system kindly recorded the exact state of the process at the time it failed to a file on disk for us. Opening this file up, I could tell that it had failed within step four - moving the cursor on to find the next record.

That's worrying, since I used an existing database system called DB4 to manage the on-disk databases, including the archive table. If it was crashing within DB4, was there anything I could do about it?

I tried checking lots of things. Was the on-disk database corrupt (no)? Did it always crash at the same point (yes, just after send a record that announced the creation of a new table, a clue I did not pick up on at the time)?.

Sadly, I didn't have debugging symbols for DB4, so while the core dump could be easily analysed to show what my code was doing, from the point my code went into DB4, all I could do was examine the assembly language program.

So after a day of barking up the wrong tree (during which I spotted and fixed another bug; the cursor wasn't being positioned correctly to begin with, so the reminder would often remind the server of a lot of records it already knew about), I started trying to figure out the assembly language. I had the DB4 source code in C, so I could see what it tried to do, but without the debugging symbols, I had no way of relating the exact point of the crash, or the contents of memory within DB4, to the source code. I had to work it out by hand.

For those who understand such things, here's the disassembly of the function, with XXXXX marking the point at which it crashed:

  mov    %rbp,0xffffffffffffffe0(%rsp)
  mov    %r12,0xffffffffffffffe8(%rsp)
  mov    %rdi,%rbp                    
  mov    %r13,0xfffffffffffffff0(%rsp)
  mov    %r14,0xfffffffffffffff8(%rsp)
  mov    %rdx,%r13                    
  mov    %rbx,0xffffffffffffffd8(%rsp)
  sub    $0x38,%rsp                   
  mov    (%rdi),%rax                  
  mov    %rsi,%r14                    
  mov    %ecx,%r12d                   
  mov    0x28(%rax),%rbx              
  mov    0x160(%rbx),%rax             
  test   %rax,%rax                    
  je     0x2b2d134616c9 <__db_c_get_pp+73>
  mov    0x30(%rax),%rax      <- XXXXX
  mov    0x4(%rax),%r11d
  test   %r11d,%r11d
  jne    0x2b2d13461740 <__db_c_get_pp+192>

As you can see, that's x64 assembly language, and I'm not familiar with it. Sure, it's pretty similar to i386, which I'm happy with, just with 64 bit registers (and more of them). However, more registers usually means a different calling convention, and I'd no idea what would be expected to be the initial value of rdi, from which the invalid pointer in rax at the point of the crash comes from.

So I did a bit of research, and found that rdi would start off as the first word-sized argument to the function - in this case, the pointer to the cursor object!

So I looked at the DB4 source to see the structure of the cursor object, and cajoled gdb (the debugger) into telling me the offsets of the fields within the chain of structures in memory and compared them to the offsets used in the memory loads, until I had an interpretation of the relevant instructions:

  mov    %rdi,%rbp              rbp = DBcursor
  mov    (%rdi),%rax            rax = DBCursor->dbp
  mov    0x28(%rax),%rbx        rbx = DBCursor->dbp->dbenv
  mov    0x160(%rbx),%rax       rax = DBCursor->dbp->dbenv->reginfo
  test   %rax,%rax                        
  je     0x2b2d134616c9 <__db_c_get_pp+73> if rax == 0 go here
  mov    0x30(%rax),%rax         <- XXXXX

Sure enough, the culprit was DBCursor->dbp->dbenv->reginfo. It was meant to be a pointer, but the value it held didn't point at a valid memory region.

With a bit of trial and error, running the program under the debugger so I could step through it an operation at a time, it became apparent that the value of reginfo was fine until the first part of step 3, "fetch the record with that name from the on-disk database (in the same way as any client would)". Sadly it only went wrong on around the six thousandth run through the cycle, so I had to arrange for the debugger to run it until just before that cycle, then pause the program and give me control to step through it while watching the state of DBCursor->dbp->dbenv->reginfo.

So I then restarted the program in the debugger and ran it again, stopping at the same point (so that reginfo was correct again), and stepped inside the operation that was causing the problem, thus causing its sub-operations to be run one by one, rather than the whole thing as a whole. Reading the value of DBCursor->dbp->dbenv->reginfo at each step was now tricky, since the operation of getting a record from the database has nothing to do with the DB cursor that the remind process creates: getting a record from the database mainly happens in normal client applications, rather than just in this specialised part of the server. Therefore, the phrase DBCursor->dbp->dbenv->reginfo made no sense in that context, so before stepping inside, I had to note down the in-memory address of DBCursor and keep an eye on it that way.

The first thing the get operation does is to check it has the correct table files open. There can be lots of database table files, and the client library keeps them all open so it can get things from whatever table the user asks for. However, when a new table is created, the client doesn't immediately know, so before trying to perform the read, it first checks to see if any new tables have been created. If any have, it stops to make sure its list of open tables is correct before continuing.

At this point, I'd forgotten that the last update to succeed before the crash was a table creation, but I was quickly reminded of it when I found that this check was the operation that clobbered reginfo.

So I restarted the program again, ran it again, again noted down the memory address of DBCursor (it moves on each run of the program), and again stepped into the get operation, and then stepped into the check-for-new-tables operation.

Sure enough, it noted that a new table had been created, so went into the routine that deals with this.

And the way this routine handled it was:

  1. Close all the tables
  2. Close down the DB4 database system itself (technically, the database environment)
  3. Reopen the DB4 database system (technically, open the database environment)
  4. Open all the tables

These were two already-existing operations: closing all the tables then the DB4 environment happens whenever a client application closes, and opening the DB4 environment then all the tables happens whenever a client application starts. Closing and reopening everything when a single table is created may seem inefficient, but table creations are rare events, and writing the routine this way (rather than actually checking to see which table has been newly created and opening that and adding it to the existing list) has two major benefits:

  1. It's quick, as I reuse the existing subroutines. Five minutes of my time rather than half an hour.
  2. It's safe, as I'm using existing code that's known to work, rather than writing something complex and new that will have new bugs.

And that routine was perfectly correct. It did its job flawlessly. And yet it was still corrupting reginfo.

The reason, of course, being that a DB cursor is not valid after the table it relates to is closed. And closing the database environment was also closing the archive table, even though the client library did not directly control it. And therefore my cursor was no longer valid.

The DB4 library didn't make the cursor invalid in a "nice" way, such as arranging that trying to use the cursor in future would cause a useful error message. And I'm not complaining - it was under no obligation to: the documentation clearly states that a cursor becomes invalid when the underlying table is closed, and that the results of trying to use such a cursor are undefined, so could be anything. Which most likely means a crash.

The root cause of the problem was one of assumptions. When I first wrote the reminder, it worked perfectly well.

Later, we ran into a problem that a client that created a table then tried to use it would fail. This was actually allowed by the specification I was writing the system to - table creations are allowed to take some time to happen after they are requested. However, it was a pain, and I was being paid for my time rather than a fixed rate for completion of the spec, so I was happy to accommodate requested changes, so I added the check for newly created tables.

And since the client library, in a normal client, has sole control over the DB4 databases, I knew that I could safely close down and reopen every database table if I wanted to, and that would be the easiest and most reliably way to open the new tables ready for use.

At the time, busy with the client library and its use in normal clients, I forgot that the reminder process used the client library to do get operations, while also itself directly accessing the archive table outside of the client library - using the same DB4 environment.

And so the change I made in the client library broke the reminder process, but only in a way that would be noticed if a table was created during a reminder operation (such as if the reminder operation reminds of the creation of a table...). And so when I tested my changes, I didn't think to go and test the reminder as well. And anyway, the reminder still worked fine. Until we happened to ask it to remind us of a table creation.

Having spent two days finding that out, I spent an hour separating out the 'close all tables' bit from the 'close all tables and the db4 environment' routine, and the same for the 'open the db4 environment then all the tables' routine, and then rewrote my 'deal with newly created tables' routine to just call the 'close all tables' and 'open all tables' routines, thus leaving the db4 environment (and the archive table, and the cursor open it) intact. And testing this, in case my new code had any new bugs in it.

Job done.

So should I be happy, or sad?

I was elated at having managed to track the bug down. It was a big intellectual challenge, and I had to use a wide range of skills and knowledge to find it, which was fun. Ok, it was horrible at the time, bashing my head against things as I went up wrong alleys, but looking back upon it as an accomplished success is nice.

But, it was my fault the bug had appeared in the first place. Should I not be sad that I didn't realise that my change to the client library would have an unpleasant consequence? Shouldn't I have tested things more thoroughly and found the bug there and then?

Well, yes and no, a bit of both.

The reception I've had from those affected by the bug or otherwise involved in the project has been "Well done! Wow! What a clever person you must be to have found the bug and come up with a solution!", despite it being my fault I produced the bug in the first place. The general feeling in the industry is that bugs are inevitable, and that no human could be expected to write bug-free software the first time - well, not without taking a very long time over it proving the correctness of everything with formal methods and writing lots and lots of detailled unit tests.

But it would be alarming if the designers of bridges, for example, were to say things like "Oh, I'd never imagined that if the wind was blowing above thirty miles per hour from due north-north-west, while the air temperature was exactly fifteen degrees and the sun was exactly overhead, concrete turned into custard - but now I think about it, it makes perfect sense!".

Pages: 1 2 3 4 5

7 Comments

  • By @ndy Macolleague, Tue 15th Jan 2008 @ 2:02 pm

    Hi,

    Here is a quick "highest number we've seen all numbers less than or equal" algorithm that I just cooked up whilst reading this. It's pretty much at the pseudo code level and I've not even tried to compile it but I think it illustrates a point.

    I've don't store the numbers that come in: I just store a set of flags so that we know if we have seen a given number or not.

    int main (void) { int i; /* Most recent value read / int blk, idx; / The block that i lives in and the index into block / int mem[1024]; / Some space. Highest value is 1024 * sizeof(int) * 8. / int h = 0; / The block that we are up to / int val; / The answer so far */

        while (h < 1024) {
                i = read();
                blk = i / (sizeof(i) * 8);
                idx = i % (sizeof(i) * 8);
    
                mem[blk] |= (1 << idx);
    
                for ( ; h<1024, mem[h] != 0xFFFF; h++);
    
                val = ((h-1) * (sizeof(int) * 8)) + pop_count(mem[h]);
    
                /* pop_count is number of bits that are 1 */
    
                printf("Highest with all less than or equal: %d.\n", val);
        }
    

    }

    Seeing as we are tracking h we could probably throw away / reuse the early parts of mem as they fill up. That would mean that mem would only have to be big enough to store the range of values that might be un decided at any given time. To do that would make the illustration above less clear so I didn't bother as it'd be mostly memory management rather than part of the core algorithm.

  • By @ndy Macolleague, Tue 15th Jan 2008 @ 2:04 pm

    Hmph! The formatting of the variable declarations has been messed up. Here it is again:

    int main (void) {

        int i; /* Most recent value read */
    
        int blk, idx; /* The block that i lives in and the index into block */
    
        int mem[1024]; /* Some space. Highest value is 1024 * sizeof(int). */
    
        int h = 0; /* The block that we are up to */
    
        int val; /* The answer so far */
    
    
        while (h < 1024) {
                i = read();
                blk = i / (sizeof(i) * 8);
                idx = i % (sizeof(i) * 8);
    
                mem[blk] |= (1 << idx);
    
                for ( ; h<1024, mem[h] != 0xFFFF; h++);
    
                val = ((h-1) * (sizeof(i) * 8)) + pop_count(mem[h]);
    
                /* pop_count is number of bits that are 1 */
    
                printf("Highest with all less than or equal: %d.\n", val);
        }
    

    }

  • By Gavan Fantom, Tue 15th Jan 2008 @ 2:12 pm

    I would have expected that for nearly-sequential input, storing ranges (ie extents) would have been a reasonable compression. The answer is then the size (well, strictly speaking, the higher end) of the first extent.

    The size of the extent map would then be a reasonable metric for the extent of disorder in the input.

    Depending on the quality of the input, an extent map of all numbers not seen so far may also be a suitable choice.

  • By @ndy Macolleague, Tue 15th Jan 2008 @ 2:19 pm

    Hi,

    This thing probably has loads of bugs... The for loop will prevent the while loop terminating. So change it to "while (val < (1024 * sizeof(int) * 8))" or just change the memory management model and make it "while (1)".

  • By @ndy Macolleague, Tue 15th Jan 2008 @ 2:27 pm

    Well... the for loop should have prevented the while loop from terminating as it should be h<1023 not h<1024 otherwise we end up with a bounds overflow on the next line. Perhaps I should have compiled this... Maybe I still should...

    Oh dear.

  • By Gavan Fantom, Tue 15th Jan 2008 @ 2:36 pm

    In my experience, the biggest factor in improving quality in software is visibility. You touched on a few of the "rules" which have been proposed over time, but the overriding rule as far as I'm concerned is "Write clear and readable code". If you can read it and understand it, you have more chance of spotting when something is not right. This usually extends to having a clear write-up or diagram describing the design clearly so that someone trying to understand the code can understand the design at the same time.

    But that's not the be all and end all of visibility. Unit tests can help here, but only if you actually run them.

    Amazingly, large codebases with multiple developers often suffer from the problem of the head of the repository not even compiling. When everybody is updating all the time it gets noticed and fixed regularly, but when you have people doing the bulk of their development on a snapshot or a branch and only updating to the head or the trunk occasionally, this can become a really big problem. Again the way to solve this is by greater visibility - in this case autobuilding.

    Another frequent failure is to spot a bug (or a potential bug), not fix it straight away, and then forget about it. Two years later it rears its ugly head, and by that time you've completely forgotten about it and have to debug from scratch. It is only after spending weeks debugging that you realise that you've already seen this, and then you wish you'd fixed it at the time, or at least written a TODO item, or tracked it in a bug database.

    And if it's in a bug database, it's visible. It's measurable. You can generate a report of all known bugs.

    So once you have readable code, documented design, (some) unit tests which you run daily right after your nightly builds, and a bug database full of all the bugs which you spotted and didn't have time to fix, you still have bugs that you don't know about. What about them?

    Well, software is not fault-tolerant except in extremely specific cases. So if a fault does occur, make sure your code spots it early and fails noisily. This will increase visibility of the problem. It will also prevent knock-on failures, as the code will simply stop executing once bad data has been detected. And also, hopefully, it will make it easier to isolate where the problem occurred. This is all extremely good news when it comes to debugging. A failed assertion is much more useful than a Segmentation violation.

    The more you can see, and the clearer it is, the higher the quality of the resulting software, given a competent and conscientious programmer.

  • By alaric, Thu 17th Jan 2008 @ 10:29 am

    @ndy: That's similar to the principle I used, I guess. Since the algorithm runs continuously with an ever-increasing (and, eventually, wrapping back to 0) sequence of numbers (yes, before people ask, this is for packet sequence numbering in a network client!), I indeed have a buffer, but set up so that it slides along the space of sequence numbers, so it only stores the "frothy" region; the "solid" region where we have seen all the sequence numbers gets pushed off of the bottom of the buffer ASAP to free up space for froth.

    Gavan: Yep, good points, thanks

Other Links to this Post

RSS feed for comments on this post.

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales