My Program Crashes When Sonya Logs in

Abstract

This article outlines a difficult low-level bug and the path to the solution.

Project

The project was DBCalc, see Virtual Memory in 64K.

Challenge

My DBCalc applications were sometimes crashing at what appeared to be random times. This did not sound like a possibility, because in the OS we were using, RSX-11M, which was a multi-user, time sharing system, each application was given its own memory, completely isolated from other apps and the system. I did not use a timer or anything that would change from one execution to the next. I was used to totally deterministic execution of programs.

Debugging

First of all, of course I tried to use the debugger. Disappointingly, DBCalc never crashed when I ran DBCalc in the debugger. So, the debugger was useless to me.

When the program crashed, it printed the contents of the processor's 8 registers, one of which, R7, was the program counter. Every time DBCalc crashed, it would show different contents of the registers, including R7. I would copy the numbers to a piece of paper, use the detailed assembly log to figure out what code was being executed and try to figure out what could be wrong with it. This took lots of time.

After many sessions, a commonality emerged between the crashes. Somehow one global variable, $DSW, would get corrupted. This variable was the entry point into DBCalc's entire runtime, so if $DSW was corrupted, nothing at all would work. The reason I used $DSW was that it had a known fixed memory address. The actual address of $DSW was 14. $DSW was described in DEC's documentation like this:

   $DSW - reserved. 

And not another word about it. I felt that I was free to use it for my own purposes. I searched through my code and confirmed that, once initialized the variable was never changed. I put print statements here and there and printed the contents of $DSW. Counter to my expectations, it did in fact change! But it changed at different times and I still did not know what piece of code was changing it. I read and reread all suspicious parts of the system - nothing.

Next, I remembered that the PDP-11 processor has a control bit, T, which, if set to 1, would stop the program after every command and invoke an interrupt routine. At this point I had spent several days chasing this bug and even though I knew this was going to be slow, I decided to try and set the bit to 1. My interrupt routine would check the $DSW variable and print the program status whenever the variable changed.

I made the necessary changes and was ready by 7pm that day. I launched the app. It ran for several hours and completed successfully! The next step was to start banging my head against hard surfaces.

The next morning, as I was out of ideas, I came in early and launched it again, unchanged. It was running normally until about 9am. At that point people started arriving and my colleague, Sonya, logged into the system and started a text editor. Sonya's presence must have somehow affected the ghost in the machine, because $DSW in DBCalc changed and my interrupt program printed the app status! Yessss! I stopped the program and found the line on which it changed. The line read:

   MOV R1, R4

The app was copying an integer from one register to another! An operation like this cannot theoretically change any memory cells. I turned off my terminal and went out for a walk. I walked and walked in the snow trying to find an explanation for the impossible. And suddenly I stopped. The idea that stopped me was this: my program could have gotten swapped out of memory when Sonya launched the text editor. I rushed back to the office and started reading all the documentation I could find about the swapping mechanism in RSX-11M. Before long, I ran into this paragraph:

When swapping a process out of RAM, the system uses the first 16 bytes of the process image to store the process context.

So much for "completely isolated from the system"! $DSW, whose address was 14, would change unpredictably whenever DBCalc was swapped out of memory!

I switched from using $DSW to some other location with an address higher than 16 and that was it.

RTFM!

Tangent

There is a lot of cursing going on in the programmers' vernacular. And somehow it's completely acceptable to use these expressions in everyday discourse and even in writing. Here are some egregious examples:

KISS Principle
"Keep it simple, stupid!", an expression used to guide software design in the direction of simplicity.
RTFM
Read the fucking manual!
BFS
Brain Fuck Scheduler. This is an important part of the Linux OS. It briefly included in Android.

If you can think of more of these, please comment.

Leave a Comment

Your email address will not be published. Required fields are marked *