[NSWI004] A02 - gitlab tests failing

Wed Nov 4 18:57:08 CET 2020

Hi,

it took us a while to reproduce the problem, but we finally managed to build a
failing kernel image outside the gitlab environment.

It turns out that the difference between the lab machine and the gitlab runner
is the path leading to the repository with your code.

The compiler stores absolute path names (for debug prints) in the image,
therefore changes in the path result in changes in the size of the resulting
kernel image.

When compiled by the same compiler (which should be the case for gitlab and
the lab machine), these are the only differences between a passing and a
failing image (along with different offsets in various instructions that
reference the strings in the image).

The failing kernel image in gitlab has 10416 bytes and we could not reproduce
the problem until the locally-built image had exactly 10416 bytes, which we
achieved by varying the path leading to the repository root (the length of the
absolute path was 68 characters, and it was a realpath, i.e., with symlinks
resolved).

Images of size 10400, 10432, 10448, or 10480 did not fail (we could not create
image of length 10464).

When your kernel fails, the interesting bits in registers are the following:

CPU registers:

ra ffffffff80001ad4   pc ffffffff80000180

CP0 registers:

08 BadVAddr  00000000

0d Cause     00000008  res1: 0 exccode: 02 (TLB Exception (Load))
                       res2: 0 ip: 00 res3: 00 ce: 0 res4: 0 bd: 0

0e EPC       80001258

The contents of $pc tell us it's a general exception handler, CP0 cause tells
us that you are touching unmapped memory and BadVAddr says you have touched
address 0, i.e., it's a NULL pointer dereference.

In this particular build, the EPC points to your function kfree(), and based
on the return address, it is the second call to kfree() from the free_block()
function in stress/test.c.

The actual failing code is apparently inlined into kfree() and seems to be
related to the coalescing. Using  __attribute__ ((noinline)) to tell GCC to
avoid inlining the coalescing functions (and playing with path length to get
the image size to 10416 bytes) makes the code crash in the coalesce_none()
function. The failing instruction is actually loading a value using the second
argument (nearest_right), which is NULL. That is of course just the place
where the problem manifests, not necessarily the root cause.

Please do your own analysis to confirm whatever we found out.

Debugging this will be a bit difficult, because the bug obviously depends on
the size of your kernel image. The size of the image can find its way into
your code only through the _kernel_end symbol, which means that any
calculations that (even transitively) depend on this value are suspect.

Wild guess is that you might have a problem with rounding/alignment in pointer
arithmetics that gets exposed at certain boundaries, which may lead you to
read a pointer from unused memory, perhaps from padding?. Might be interesting
to fill memory with some address dependent garbage that could only produce
pointer to lower 2GB (so you will get a TLB fault, just like in case of a NULL
pointer).

Also, if you wanted to find a pattern, you could try to inject different
values instead _kernel_end into the code that depends on it -- this might be
easier than trying to play with the kernel image size by changing paths.

You may also consider simplifying your code, i.e., going back to a kfree()
without coalescing first, but keep in mind that it may also make the bug
disappear without fixing.

This is a kind of bug that, if not discovered/fixed now, could make your code
crash randomly later on, so with all the bad news, there is also a bit of good
news, even though I understand it is difficult to appreciate at this point.

I guess that's as far as we can go for now.

Best regards,
Lubomir Bulej

On 04/11/2020 10:14, Tomáš Husák wrote:
> Hello,
> 
> we have a problem with passing the tests in gitlab. Before the last commit,
> all tests passed. But after the last commit, one test heap/stress:m1024
> failed. It is wierd, because when we execute the tester for this suite on
> local lab machine, all tests pass. Is there something different between
> testing in gitlab and testing on lab machine ?
> 
> 
> Thank you
> 
> Tomas Husak
> 
>