[NSWI004] A04 - no interrupt handling?

Mon Dec 7 16:38:47 CET 2020

Hello,

finally a response to the mailing-list in case others might be 
interested in the root cause.

Dne 06. 12. 20 v 23:43 Petr Tlapa napsal(a):
> Hello,
> we happened to encounter weird behaviour when interrupting - the 
> function handle_exception_general() is never called (we have just 
> panic() in it to test it.) - there is just a freeze in a program.
> (yes we call timer_interrupt_after() and the status register is set 
> to 0xff01 when creating a thread.)
> 
> Moreover when run on MSIM version 1.4.0 everything works just as expected.
> Could the version 1.4.2 somehow be a cause of this or some not seen 
> error in our code?

The actual cause of the problem was stack overflow. Furthermore, it was 
stack overflow of the initial stack, even before running first real thread.

Because this stack is located just below the code (but above exception 
handlers), its overflow have not corrupted any C code but only the 
exception handlers. Hence, the "freeze" was actually an endless loop 
where the exception handler (a.k.a. stack treated as code) was causing 
nested exceptions forever.

The reason for stack overflow was that one of the printing functions was 
recursive and needed surprisingly lot of memory for its local variables.

The more interesting question is how to detect this and how to debug this.

First thing to do is to interrupt the execution and look at CP0 
registers. In this case, the "Cause" register was showing TLB load 
exception, and the status register had a value of 0xff03. That is by one 
bit different from the magical 0xff01, the extra bit (consult the manual 
for details) announces we are in an exception handler. But since the 
exception handler shall panic() as the first thing, it suggests the 
exception handler is broken. [Other option is to run msim -t and notice 
how TLB exceptions are rethrown forever.]

Whether you would run msim -t or break the simulation, it is worth to 
run a few instructions to see where the code is and what is happening. 
Since the handler is in C, you should see a normal prologue among the 
instructions, starting with a jump (to the C routine). That was not the 
case, hence we dumped the instructions of the exception handler via

dumpins 0x180 10

(as 0x180 is the hard-wired address of an exception handler). The code 
shall be something like (see head.S)

   0x000000180    j 0x630
   0x000000184    nop

When we have seen

   0x000000180    daddiu a1, s3, 25699
   0x000000184    nop
   0x000000188    nop
   0x00000018c    sll 0, 0, 2
   0x000000190    lb 0, 408(0)
   0x000000194    lb 0, 2236(0)

instead, it became clear that the code was overwritten.

Then it was rather easy to determine the cause as inserting simple

printk("##### %pF\n", 0x80000180);

into kernel_main() after each subsystem initialization showed that the 
stack overflows during heap initialization which lead to the recursive 
printk implementation in the end.

And how to prevent this? Here the situation was a bit more complicated 
by the fact that it was the initial stack that is not that clearly 
visible in the code (see head.S for setting $sp).

Nevertheless, it is always possible to put a special value at stack end 
and check in thread_join if the value is still there (as a matter of 
fact, putting it near the end with extra space after means that the 
kernel may run without corrupting anything and you may still detect the 
near-overflow).

Please note that whenever you want us to help you debug your code, it is 
really necessary to state which commit and which test you are using (and 
preferably check it on the lab.d3s machine). [In this scenario, we had 
another obstacle as on my machine, I had a bit older GCC and there the 
stack smashing stopped 4 bytes earlier (different alignment somewhere, I 
suppose) and the particular test passed.]

Cheers,
- VH