[NSWI004] A04 - no interrupt handling?
Vojtech Horky
horky at d3s.mff.cuni.cz
Mon Dec 7 16:38:47 CET 2020
Hello,
finally a response to the mailing-list in case others might be
interested in the root cause.
Dne 06. 12. 20 v 23:43 Petr Tlapa napsal(a):
> Hello,
> we happened to encounter weird behaviour when interrupting - the
> function handle_exception_general() is never called (we have just
> panic() in it to test it.) - there is just a freeze in a program.
> (yes we call timer_interrupt_after() and the status register is set
> to 0xff01 when creating a thread.)
>
> Moreover when run on MSIM version 1.4.0 everything works just as expected.
> Could the version 1.4.2 somehow be a cause of this or some not seen
> error in our code?
The actual cause of the problem was stack overflow. Furthermore, it was
stack overflow of the initial stack, even before running first real thread.
Because this stack is located just below the code (but above exception
handlers), its overflow have not corrupted any C code but only the
exception handlers. Hence, the "freeze" was actually an endless loop
where the exception handler (a.k.a. stack treated as code) was causing
nested exceptions forever.
The reason for stack overflow was that one of the printing functions was
recursive and needed surprisingly lot of memory for its local variables.
The more interesting question is how to detect this and how to debug this.
First thing to do is to interrupt the execution and look at CP0
registers. In this case, the "Cause" register was showing TLB load
exception, and the status register had a value of 0xff03. That is by one
bit different from the magical 0xff01, the extra bit (consult the manual
for details) announces we are in an exception handler. But since the
exception handler shall panic() as the first thing, it suggests the
exception handler is broken. [Other option is to run msim -t and notice
how TLB exceptions are rethrown forever.]
Whether you would run msim -t or break the simulation, it is worth to
run a few instructions to see where the code is and what is happening.
Since the handler is in C, you should see a normal prologue among the
instructions, starting with a jump (to the C routine). That was not the
case, hence we dumped the instructions of the exception handler via
dumpins 0x180 10
(as 0x180 is the hard-wired address of an exception handler). The code
shall be something like (see head.S)
0x000000180 j 0x630
0x000000184 nop
When we have seen
0x000000180 daddiu a1, s3, 25699
0x000000184 nop
0x000000188 nop
0x00000018c sll 0, 0, 2
0x000000190 lb 0, 408(0)
0x000000194 lb 0, 2236(0)
instead, it became clear that the code was overwritten.
Then it was rather easy to determine the cause as inserting simple
printk("##### %pF\n", 0x80000180);
into kernel_main() after each subsystem initialization showed that the
stack overflows during heap initialization which lead to the recursive
printk implementation in the end.
And how to prevent this? Here the situation was a bit more complicated
by the fact that it was the initial stack that is not that clearly
visible in the code (see head.S for setting $sp).
Nevertheless, it is always possible to put a special value at stack end
and check in thread_join if the value is still there (as a matter of
fact, putting it near the end with extra space after means that the
kernel may run without corrupting anything and you may still detect the
near-overflow).
Please note that whenever you want us to help you debug your code, it is
really necessary to state which commit and which test you are using (and
preferably check it on the lab.d3s machine). [In this scenario, we had
another obstacle as on my machine, I had a bit older GCC and there the
stack smashing stopped 4 bytes earlier (different alignment somewhere, I
suppose) and the particular test passed.]
Cheers,
- VH
More information about the NSWI004
mailing list