Friday, April 19, 2013

Finding a Firmware bug...

I spent the last few days hunting down a bug.
It was interesting enough that I thought I'd write a short blog post....

First some background...
In all of the NetBurner products we try very hard to make sure address space 0 is unmapped,
ie any attempt to reference a NULL pointer should cause an error.
The particular processor I was bug hunting on boots at address zero so in most normal scenarios address 0 is mapped to some kind of memory.

In our case this is not so...after the boot monitor boots, it turns off the hardware memory mapping to address 0...and relocates all valid memory up higher in the memory map...
  (We are not using an MMU)

The bug we were hunting was an interrupt latency bug, 99.99% of the time the part has an interrupt latency of 0.9 usec or so...  every once in a while it would have a longer latency, randomly varying from 1 to 520usec.   520usec is WAAAAAY too long.....

So to instrument this I setup one of the on chip timers to make a pulse and reset the counter at a fixed rate...
I hooked this timer output up to the non maskable interrupt and in the interrupt the first thing I did was read the timer counter value... this is an all on chip direct measurement of the interrupt latency...

And soon found that it varied randomly from 0.9 to 520usec.
Step 1 complete we can replicate the problem.....

After this I tried a whole bunch of things to hunt this down, looked at code, moved stacks and vector tables to/from different classes of memory... nothing changes the result....

If I make the whole code set small enough so that it all fits in the instruction cache it goes away....

As a random idea I changed the bus timeout monitor from  very long to something shorter...
and the latency improved... this was the clue I needed to hunt down the bug....

Here is the bug:

We use a lot of C code in the system of the form:


         if ( pDHCPTick )
         {  
             pDHCPTick();
 }

This code checks a function pointer to see if its null and calls the function  if it isn't...
This chunk of code is inside one of the system tasks and is used to service the DHCP client if its active...


This compiles to  assembly code something like:

moveal  pDHCPTick,%a0
testl %a0
beqw  skip
jsr @%a0
skip:

So if pDHCPTick is null the jsr @%a0 will never be executed....
But the processor I'm having the issue with has aggressive pipelining....
So it tries to load the @%a0 even if it does not call it. 

This causes an instruction fetch  of  address 0....
If address 0 is not in the instruction cache then this forces a cache miss and a cache line read from physical address 0.....

Physical address 0 is un-mapped so the bus timeout monitor goes off terminating the transaction 520usec later...
If the interrupt has the misfortune of trying to go off while we are hung in the bus time out we get long latency....

So the problem was solved by setting up the system so there is a valid bus ack at address 0....
Makes it so we don't catch NULL pointer accesses... but fixes the interrupt latency...

We are working on a better solution.. but I thought the problem crossed enough different hardware/software domains where it was interesting.

6 Comments:

Blogger heroineworshipper said...

Wonder if all pipelined CPUs have this lockup problem or if it's a Freescale feature.

7:01 PM  
Anonymous Andrew Platzer said...

Does declaring pDHCPTick as 'volatile' help?

10:07 AM  
Blogger Paul Breed said...

It is declared as volatile, as its set in a differnt thread than its used..... but know volatile changes the way the compiler compiles things, but has zero effect at the assembly laanguage/machine instruction level... the problem is there not in compiler optimization....

4:36 PM  
Blogger Andy said...

Which processor is this?

4:49 PM  
Blogger Paul Breed said...

Processor is Freescale coldfire MCF5441X

1:25 PM  
Blogger Andy said...

Thanks!

4:51 PM  

Post a Comment

<< Home