One question that Junior Developers frequently ask me is "How do you debug?".
I have a reasonably good track record on debugging problems, quite often on systems where I have zero prior knowledge. The general rules I follow for debugging other people's programs are:
Ask what is supposed to be happening, and what is actually happening.
Require proof that every step is actually happening.
In the course of this, almost all of the time we'll find one of the following results:
- They weren't running the code they thought they were (usually failure to do a clean build)
- Twenty seconds into the explanation, they will realise the problem, quite possibly without me having a clue how anything hangs together (Rubber Duck Debugging).
- Their supposition about what should happen is at odds with what the code actually says, and therefore with what it does. I call this "The compiler is broken! No it isn't!" This is really a variant on Rubber Duck Debugging, except that it needs me to point out, for example, that they have a symbol mismatch, or a stray semi-colon.
- There is a compiler warning announcing precisely what the problem is, which is being ignored.
- There is a compiler warning that has been turned off and some point in the project's history, which would have reported this issue.
- This is a known problem on iOS X.xx that we have run into on other projects. As a company, we're getting better at both disseminating information about this kind of thing via sprint kick-offs, and also by teaching our engineers to spot when something smells.
Once we get past that stage, I then follow this approach:
Is there any example of this working, anywhere? If so, what's different about my code?
If it all makes no sense at all, step away from the problem for a bit. Make a cup of tea, or even better, go do some exercise.
I recently had a thorny issue on some BLE Firmware where I failed to follow my own rules, and demonstrated why they work! We were working on Nordic's new nRF52832 chipset, using both Nordic nRF52 dev board, and Rigado's BMD-300 Evaluation Board.
Nordic and Rigado dev boards
For some reason the firmware I had written was drawing far far too much current in System-On Sleep mode, when it should have been essentially unobservable on the scope I was using.
Excess current draw - CurB is the sleep current, which should be essentially zero.
Starting to debug
Applying my third rule, I rapidly established that a simple cut-down version of my program which broadcast adverts but nothing else did not draw excess current.
There were two clear possibilities. Either it was actually leaking huge amounts of current when asleep, or it wasn't actually going to sleep. Either case was triggered by something in my program.
I immediately had a few suspicions:
- The chipset, dev board, and SDK were relatively new. Was there any Product Anomaly that applied to my board? I identified the precise variant of the chipset I was using, and found nothing suspicious in the documentation for the chipset or for the Nordic dev board. I also checked the SoftDevice release notes.
- I was using the SAADC from a radio notification callback. Was I perhaps running into some edge case that the manufacturer hadn't tested? That's a bit like claiming the compiler is broken. Possible, but not probable.
- I had missed something in the documentation about how to put the device to sleep. This seemed to be the most likely case, but I did lots of searching in the soft device documentation, and the forums, and didn't find anything.
I had a look in the SDK examples, and found a program that used the SAADC. It worked fine. I extended it to use the radio notification callback. It worked fine. I cut-down my program and found the function that was causing the problem, narrowing down the issue to a particular line of code (more on that later) which seemed fairly innocuous - it just wrote to a block of memory outside the stack.
I had, by this stage, unwittingly broken my first and second rules of debugging. And once the second rule goes wrong, everything else is suspect. I then wasted some time trying to establish why writing to memory should cause a current leak.
The section of code that confused me
When I started debugging the problem, I forgot to turn off optimisations in the compiler. That meant the code I was looking at was not necessarily the code that was being run, which broke my First Rule. In particular, some floating point operations I had executed a few lines higher were only actually compiled in if the result was used.
I then broke my Rules because I decided not to ask anyone for help, on the grounds that this problem related to a development environment that no one else in the office knew anything about. While this was true, that only meant that most likely no one could help me fix the problem. It did not mean that no one could help me diagnose the problem by spending five minutes sanity checking my own diagnosis process. If I had even then proved to myself that the code I was looking at was being run, I would have been okay. But I didn't. However, I was starting to not trust the diagnosis any more.
However, I did get one thing right. Instead of going back to the office after dinner to work on the problem (largely, I admit, because I couldn't find my office keys, and needed to use the oscilliscope) I decided to clean up my test cases a bit, and then got an early night (Rule 4). The next morning, working with a fresh mind (this is a partially valid application of rule 1!) I realised my error in not disabling optimisations (Rule 2), and rapidly found the problem.
The eventual problem turned out to be complex - but once I had the diagnosis, finding a fix was actually quite quick. I turned off optimisations, and discovered that the offending piece of code was actually the floating point instructions. A swift search for "nRF52 FPU sleep" in the Nordic dev forums brought up a related issue as the first link. It wasn't initially clear to me how this related to my problem, but the important thing was the fix provided worked.
The actual problem - feel free to skip
After another half hour of investigation, I realised that my case was actually the same as that discussed in a forum query. There, the divide by zero triggered the DZC bit in the FPU Status Control Register.. In my case, a conversion from an integer to a float triggered the IXC bit which indicates an Inexact Floating Point Exception.
Both exceptions triggered an FPU IRQ that needed to be cleared before the CPU would sleep, therefore explaining my current leak. A spot more googling showed that the issue was actually documented by Nordic - only in the SDK documentation rather than the soft device documentation because it wasn't actually a bug, it was merely the documented way that things worked.
The final results
The new oscilliscope trace - note that CurB (sleep current) is now zero, as closely as I can measure.
So what do we conclude? My rules would have worked - if only I had applied them properly!