Possible glibc bug manifesting only on SMP ARMv7 systems

List overview All Threads
Download

newer

older

Fedora ARM F-15 Branched report:...

Fedora 15 Koji Que Script

Gordan Bobic

29 Nov 2011 29 Nov '11

7:30 a.m.

Guys,

After chasing my tail for ages thinking I had a hardware issue on an AC100, it looks like the random segfaults and "glibc detected a corrupted doubly linked list" errors might actually be SMP and/or ARMv7 related.

Errors: - random segfaults - glibc detected a corrupted doubly linked list

Distro: Fedora 13

Platforms that work flawlessly (24/7 compiling for weeks): - Marvell Kirkwood (1x SheevaPlug, 1x DreamPlug).

Platforms that cause repeatable segfaults (same rootfs, same operation): - Tegra2 (tested using Toshiba AC100 and Compulab TrimSlice) - OMAP 4xxx (tested on a PandaBoard)

I'm going to dig into this deeper (boot the machine with nosmp or tasksetting everything to run on the same core), but in the meantime I would like to ask if there is a bug in any of the following:

- glibc - gcc - binutils

that might cause them to misbehave either on: - ARMv7 (armv5tel packages on armv7l kernel) or - SMP ARM systems (or both)

I'm going to compile up a clean kernel (without all the hacks I tried on the AC100 to try to troubleshoot the issue) and try building the packages in a clean F13 mock just to do a definitive confirmation pass, but if anyone is aware of any such issues (e.g. due to locking primitives being different on ARMv7) that have been fixed in glibc/gcc/binutils recently, I would appreciate any info you may have on the subject.

Ubuntu doesn't appear to suffer from this issue, but they use a much newer gcc and a different glibc than what is in F13.

Gordan

Show replies by date

Peter Robinson

29 Nov 29 Nov

7:45 a.m.

On Tue, Nov 29, 2011 at 1:30 PM, Gordan Bobic gordan@bobich.net wrote:

...

Guys,

After chasing my tail for ages thinking I had a hardware issue on an AC100, it looks like the random segfaults and "glibc detected a corrupted doubly linked list" errors might actually be SMP and/or ARMv7 related.

Errors:

random segfaults

glibc detected a corrupted doubly linked list

Distro: Fedora 13

Platforms that work flawlessly (24/7 compiling for weeks):

Marvell Kirkwood (1x SheevaPlug, 1x DreamPlug).

Platforms that cause repeatable segfaults (same rootfs, same operation):

Tegra2 (tested using Toshiba AC100 and Compulab TrimSlice)

OMAP 4xxx (tested on a PandaBoard)

I'm going to dig into this deeper (boot the machine with nosmp or tasksetting everything to run on the same core), but in the meantime I would like to ask if there is a bug in any of the following:

glibc

gcc

binutils

that might cause them to misbehave either on:

ARMv7 (armv5tel packages on armv7l kernel)

or

SMP ARM systems

(or both)

I'm going to compile up a clean kernel (without all the hacks I tried on the AC100 to try to troubleshoot the issue) and try building the packages in a clean F13 mock just to do a definitive confirmation pass, but if anyone is aware of any such issues (e.g. due to locking primitives being different on ARMv7) that have been fixed in glibc/gcc/binutils recently, I would appreciate any info you may have on the subject.

Ubuntu doesn't appear to suffer from this issue, but they use a much newer gcc and a different glibc than what is in F13.

Have you tried a F-14 rootfs?

Peter

Gordan Bobic

7:48 a.m.

On 11/29/2011 01:45 PM, Peter Robinson wrote:

...

On Tue, Nov 29, 2011 at 1:30 PM, Gordan Bobicgordan@bobich.net wrote:

...
Guys,

After chasing my tail for ages thinking I had a hardware issue on an AC100, it looks like the random segfaults and "glibc detected a corrupted doubly linked list" errors might actually be SMP and/or ARMv7 related.

Errors:

random segfaults

glibc detected a corrupted doubly linked list

Distro: Fedora 13

Platforms that work flawlessly (24/7 compiling for weeks):

Marvell Kirkwood (1x SheevaPlug, 1x DreamPlug).

Platforms that cause repeatable segfaults (same rootfs, same operation):

Tegra2 (tested using Toshiba AC100 and Compulab TrimSlice)

OMAP 4xxx (tested on a PandaBoard)

I'm going to dig into this deeper (boot the machine with nosmp or tasksetting everything to run on the same core), but in the meantime I would like to ask if there is a bug in any of the following:

glibc

gcc

binutils

that might cause them to misbehave either on:

ARMv7 (armv5tel packages on armv7l kernel)

or

SMP ARM systems

(or both)

I'm going to compile up a clean kernel (without all the hacks I tried on the AC100 to try to troubleshoot the issue) and try building the packages in a clean F13 mock just to do a definitive confirmation pass, but if anyone is aware of any such issues (e.g. due to locking primitives being different on ARMv7) that have been fixed in glibc/gcc/binutils recently, I would appreciate any info you may have on the subject.

Ubuntu doesn't appear to suffer from this issue, but they use a much newer gcc and a different glibc than what is in F13.

Have you tried a F-14 rootfs?

Not yet, no. I'll try that when I have a clean recipe for demonstrating the problem on F13. In the meantime any pointers at bugs/patches for anything like this would be welcome.

Gordan

Gordan Bobic

8:01 a.m.

On 11/29/2011 01:48 PM, Gordan Bobic wrote:

...

On 11/29/2011 01:45 PM, Peter Robinson wrote:

...
On Tue, Nov 29, 2011 at 1:30 PM, Gordan Bobicgordan@bobich.net wrote:

...
Guys,

After chasing my tail for ages thinking I had a hardware issue on an AC100, it looks like the random segfaults and "glibc detected a corrupted doubly linked list" errors might actually be SMP and/or ARMv7 related.

Errors:

random segfaults

glibc detected a corrupted doubly linked list

Distro: Fedora 13

Platforms that work flawlessly (24/7 compiling for weeks):

Marvell Kirkwood (1x SheevaPlug, 1x DreamPlug).

Platforms that cause repeatable segfaults (same rootfs, same operation):

Tegra2 (tested using Toshiba AC100 and Compulab TrimSlice)

OMAP 4xxx (tested on a PandaBoard)

I'm going to dig into this deeper (boot the machine with nosmp or tasksetting everything to run on the same core), but in the meantime I would like to ask if there is a bug in any of the following:

glibc

gcc

binutils

that might cause them to misbehave either on:

ARMv7 (armv5tel packages on armv7l kernel)

or

SMP ARM systems

(or both)

I'm going to compile up a clean kernel (without all the hacks I tried on the AC100 to try to troubleshoot the issue) and try building the packages in a clean F13 mock just to do a definitive confirmation pass, but if anyone is aware of any such issues (e.g. due to locking primitives being different on ARMv7) that have been fixed in glibc/gcc/binutils recently, I would appreciate any info you may have on the subject.

Ubuntu doesn't appear to suffer from this issue, but they use a much newer gcc and a different glibc than what is in F13.

One other thing - one of the manifestations of this bug appears to be random memory corruption (strange, I know - unless I am dealing with two totally unrelated problems). Specifically, I have seen the bug manifest during compile jobs where, for example, linking would segfault, and re-making would segfault again. But doing: echo 3 > /proc/sys/vm/drop_caches would fix the problem.

My first suspicion was duff hardware/RAM on my AC100. So I got another one, and it behaves in the exact same way.

Then I thought that maybe they are all pre-overclocked past stable points, so I started hacking at the kernel to drop clock speeds and memory timings (they are bootloader and kernel settable on Tegra2), and none of that made any difference (apart from making the machine slower - the instability remained).

Then I started looking at possible Tegra2 specific bugs, like the TLS register bug. Couldn't get to any conclusive results on that, unfortunately, but nobody running Ubuntu seems to have seen any similar issues on the same hardware.

A couple of days ago somebody on #AC100 offered to re-run my test (building hsqldb src.rpm in mock) on their TrimSlice and on their PandaBoard to try to establish whether the problem might be SMP and/or ARMv7 specific (since I get no stability issues at all on my single-core Kirkwood devices. And sure enough - they saw the same random segfaults arise on BOTH the TrimSlice (Tegra2 A9 SMP) _AND_ the PandaBoard (OMAP 4xxx A9 SMP).

Which implies that the problem is to do with either SMP or running on ARMv7 CPUs, which would indicate an issue with either the glibc or the toolchain, but that is just guessing at the moment.

Any suggestions welcome at this point.

Gordan

Andrew Haley

8:42 a.m.

On 11/29/2011 02:01 PM, Gordan Bobic wrote:

...

One other thing - one of the manifestations of this bug appears to be random memory corruption (strange, I know - unless I am dealing with two totally unrelated problems). Specifically, I have seen the bug manifest during compile jobs where, for example, linking would segfault, and re-making would segfault again. But doing: echo 3 > /proc/sys/vm/drop_caches would fix the problem.

My first suspicion was duff hardware/RAM on my AC100. So I got another one, and it behaves in the exact same way.

The most likely explanation is that you've got a data race somewhere. SMP ARM, unlike x86, has a weakly-ordered memory model. Unless everyone is extremely careful, problems like the one you're describing are very likely.

Andrew.

Gordan Bobic

8:46 a.m.

On 11/29/2011 02:42 PM, Andrew Haley wrote:

...

On 11/29/2011 02:01 PM, Gordan Bobic wrote:

...
One other thing - one of the manifestations of this bug appears to be random memory corruption (strange, I know - unless I am dealing with two totally unrelated problems). Specifically, I have seen the bug manifest during compile jobs where, for example, linking would segfault, and re-making would segfault again. But doing: echo 3> /proc/sys/vm/drop_caches would fix the problem.

My first suspicion was duff hardware/RAM on my AC100. So I got another one, and it behaves in the exact same way.

The most likely explanation is that you've got a data race somewhere. SMP ARM, unlike x86, has a weakly-ordered memory model. Unless everyone is extremely careful, problems like the one you're describing are very likely.

Indeed, I was thinking about some kind of a concurrency issue, too, but the question is how to fix it. Assuming for a moment that it is not a kernel issue (other people are running he same kernel with Ubuntu without this problem), are we talking about glibc? Or are you saying that _any_ package could be responsible for such a thing?

How can this be fixed?

Gordan

Andrew Haley

9:04 a.m.

On 11/29/2011 02:46 PM, Gordan Bobic wrote:

...

On 11/29/2011 02:42 PM, Andrew Haley wrote:

...
On 11/29/2011 02:01 PM, Gordan Bobic wrote:

...
One other thing - one of the manifestations of this bug appears to be random memory corruption (strange, I know - unless I am dealing with two totally unrelated problems). Specifically, I have seen the bug manifest during compile jobs where, for example, linking would segfault, and re-making would segfault again. But doing: echo 3> /proc/sys/vm/drop_caches would fix the problem.

My first suspicion was duff hardware/RAM on my AC100. So I got another one, and it behaves in the exact same way.

The most likely explanation is that you've got a data race somewhere. SMP ARM, unlike x86, has a weakly-ordered memory model. Unless everyone is extremely careful, problems like the one you're describing are very likely.

Indeed, I was thinking about some kind of a concurrency issue, too, but the question is how to fix it. Assuming for a moment that it is not a kernel issue (other people are running he same kernel with Ubuntu without this problem), are we talking about glibc? Or are you saying that _any_ package could be responsible for such a thing?

It's possible that glibc is the problem, but not very likely. User programs can't cause this problem unless they're multi-threaded. As far as I know the linker isn't multi-threaded, though.

Andrew.

Gordan Bobic

10:02 a.m.

On 11/29/2011 03:04 PM, Andrew Haley wrote:

...

On 11/29/2011 02:46 PM, Gordan Bobic wrote:

...
On 11/29/2011 02:42 PM, Andrew Haley wrote:

...
On 11/29/2011 02:01 PM, Gordan Bobic wrote:

...
One other thing - one of the manifestations of this bug appears to be random memory corruption (strange, I know - unless I am dealing with two totally unrelated problems). Specifically, I have seen the bug manifest during compile jobs where, for example, linking would segfault, and re-making would segfault again. But doing: echo 3> /proc/sys/vm/drop_caches would fix the problem.

My first suspicion was duff hardware/RAM on my AC100. So I got another one, and it behaves in the exact same way.

The most likely explanation is that you've got a data race somewhere. SMP ARM, unlike x86, has a weakly-ordered memory model. Unless everyone is extremely careful, problems like the one you're describing are very likely.

Indeed, I was thinking about some kind of a concurrency issue, too, but the question is how to fix it. Assuming for a moment that it is not a kernel issue (other people are running he same kernel with Ubuntu without this problem), are we talking about glibc? Or are you saying that _any_ package could be responsible for such a thing?

It's possible that glibc is the problem, but not very likely. User programs can't cause this problem unless they're multi-threaded. As far as I know the linker isn't multi-threaded, though.

So could a broken program cause something else to crash, or is the danger limited to the program that is running? The reason I ask is because I have also seen some weirdness where, for example, gcc would get stuck in an infinite loop during compiling, typically bloating until the OOM killer terminates it. It doesn't happen often, but I have seen it more than once.

Gordan

Henrik Nordström

30 Nov 30 Nov

2:43 p.m.

tis 2011-11-29 klockan 16:02 +0000 skrev Gordan Bobic:

...

So could a broken program cause something else to crash, or is the danger limited to the program that is running?

For a broken program to cause something else to crash either a) the output of the broken program is used as input to the other program and contains error causing it to crash. b) the broken program causes an acute resource shortage (i.e. out of virtual memory) causing other programs running at the same time to crash c) it's actually the kernel that is broken

...

The reason I ask is because I have also seen some weirdness where, for example, gcc would get stuck in an infinite loop during compiling, typically bloating until the OOM killer terminates it. It doesn't happen often, but I have seen it more than once.

We have seen a couple of those on ARMv7hl f15, related to optimization or debug information exploding in size. But those have always been 100% reproducible. And most often repeatable on i386 as well, not just as noticeable there due to larger memory & swap and Fedora build farm running on x86_64 kernels which gives almost full 4GB virtual address space to applications. (3GB max on 32-bit kernels, i.e. arm)

Regards Henrik

Gordan Bobic

5:18 p.m.

On 11/30/2011 08:43 PM, Henrik Nordström wrote:

...

tis 2011-11-29 klockan 16:02 +0000 skrev Gordan Bobic:

...
So could a broken program cause something else to crash, or is the danger limited to the program that is running?

For a broken program to cause something else to crash either a) the output of the broken program is used as input to the other program and contains error causing it to crash. b) the broken program causes an acute resource shortage (i.e. out of virtual memory) causing other programs running at the same time to crash c) it's actually the kernel that is broken

...
The reason I ask is because I have also seen some weirdness where, for example, gcc would get stuck in an infinite loop during compiling, typically bloating until the OOM killer terminates it. It doesn't happen often, but I have seen it more than once.

We have seen a couple of those on ARMv7hl f15, related to optimization or debug information exploding in size. But those have always been 100% reproducible. And most often repeatable on i386 as well, not just as noticeable there due to larger memory& swap and Fedora build farm running on x86_64 kernels which gives almost full 4GB virtual address space to applications. (3GB max on 32-bit kernels, i.e. arm)

That's good to know.

I have been digging a little deeper and from what I can tell, it would appear that I have an issue related to gcj on an SMP machine. It segfaults. For example when rebuilding the gcc package, sometimes this happens:

/bin/sh ./libtool --tag=GCJ --mode=compile /builddir/build/BUILD/gcc-4.4.5-20110214/obj-armv5tel-redhat-linux-gnueabi/./gcc/gcj -B/builddir/build/BUILD/gcc-4.4.5-20110214/obj-armv5tel-redhat-linux-gnueabi/armv5tel-redhat-linux-gnueabi/libjava/ -B/builddir/build/BUILD/gcc-4.4.5-20110214/obj-armv5tel-redhat-linux-gnueabi/./gcc/ -B/usr/armv5tel-redhat-linux-gnueabi/bin/ -B/usr/armv5tel-redhat-linux-gnueabi/lib/ -isystem /usr/armv5tel-redhat-linux-gnueabi/include -isystem /usr/armv5tel-redhat-linux-gnueabi/sys-include -fclasspath= -fbootclasspath=../../../libjava/classpath/lib --encoding=UTF-8 -Wno-deprecated -fbootstrap-classes -O2 -g -pipe -Wall -fexceptions -fstack-protector --param=ssp-buffer-size=4 -march=armv5te -c -o gnu/java/security/hash.lo -fsource-filename=/builddir/build/BUILD/gcc-4.4.5-20110214/obj-armv5tel-redhat-linux-gnueabi/armv5tel-redhat-linux-gnueabi/libjava/classpath/lib/classes -MT gnu/java/security/hash.lo -MD -MP -MF gnu/java/security/hash.deps @gnu/java/security/hash.list libtool: compile: /builddir/build/BUILD/gcc-4.4.5-20110214/obj-armv5tel-redhat-linux-gnueabi/./gcc/gcj -B/builddir/build/BUILD/gcc-4.4.5-20110214/obj-armv5tel-redhat-linux-gnueabi/armv5tel-redhat-linux-gnueabi/libjava/ -B/builddir/build/BUILD/gcc-4.4.5-20110214/obj-armv5tel-redhat-linux-gnueabi/./gcc/ -B/usr/armv5tel-redhat-linux-gnueabi/bin/ -B/usr/armv5tel-redhat-linux-gnueabi/lib/ -isystem /usr/armv5tel-redhat-linux-gnueabi/include -isystem /usr/armv5tel-redhat-linux-gnueabi/sys-include -fclasspath= -fbootclasspath=../../../libjava/classpath/lib --encoding=UTF-8 -Wno-deprecated -fbootstrap-classes -O2 -g -pipe -Wall -fexceptions -fstack-protector --param=ssp-buffer-size=4 -march=armv5te -c -fsource-filename=/builddir/build/BUILD/gcc-4.4.5-20110214/obj-armv5tel-redhat-linux-gnueabi/armv5tel-redhat-linux-gnueabi/libjava/classpath/lib/classes -MT gnu/java/security/hash.lo -MD -MP -MF gnu/java/security/hash.deps @gnu/java/security/hash.list -fPIC -o gnu/java/security/.libs/hash.o /builddir/build/BUILD/gcc-4.4.5-20110214/libjava/classpath/gnu/java/security/hash/Tiger.java: In method 'gnu.java.security.hash.Tige r.transform(byte[],int)': In file included from /builddir/build/BUILD/gcc-4.4.5-20110214/libjava/classpath/gnu/java/security/hash/Sha512.java:277, from /builddir/build/BUILD/gcc-4.4.5-20110214/libjava/classpath/gnu/java/security/hash/Sha384.java:275, from /builddir/build/BUILD/gcc-4.4.5-20110214/libjava/classpath/gnu/java/security/hash/Sha256.java:248, from /builddir/build/BUILD/gcc-4.4.5-20110214/libjava/classpath/gnu/java/security/hash/Sha160.java:239, from /builddir/build/BUILD/gcc-4.4.5-20110214/libjava/classpath/gnu/java/security/hash/RipeMD160.java:289, from /builddir/build/BUILD/gcc-4.4.5-20110214/libjava/classpath/gnu/java/security/hash/RipeMD128.java:255, from /builddir/build/BUILD/gcc-4.4.5-20110214/libjava/classpath/gnu/java/security/hash/MD5.java:369, from /builddir/build/BUILD/gcc-4.4.5-20110214/libjava/classpath/gnu/java/security/hash/MD4.java:336, from /builddir/build/BUILD/gcc-4.4.5-20110214/libjava/classpath/gnu/java/security/hash/MD2.java:255, from /builddir/build/BUILD/gcc-4.4.5-20110214/libjava/classpath/gnu/java/security/hash/IMessageDigest.java:0, from /builddir/build/BUILD/gcc-4.4.5-20110214/libjava/classpath/gnu/java/security/hash/Haval.java:805, from /builddir/build/BUILD/gcc-4.4.5-20110214/libjava/classpath/gnu/java/security/hash/HashFactory.java:183, from /builddir/build/BUILD/gcc-4.4.5-20110214/libjava/classpath/gnu/java/security/hash/BaseHash.java:158, from <built-in>:17: /builddir/build/BUILD/gcc-4.4.5-20110214/libjava/classpath/gnu/java/security/hash/Tiger.java:863: internal compiler error: Segmentation fault

But, rerunning the libtool command above the second time succeeds.

Works fine on a non-SMP machine, though.

Because I'm building things in mock, it's a tad hard to debug as is. I'll try to get a core dump and see whether that provides any useful insights. I'll also try the same thing after booting the kernel with nosmp just to cross-check, since that cured an identical problem when building hsqldb, for example (at the expense of running on only one core).

Meanwhile - does anybody know of a bug like this that might have been fixed?

Gordan

4531

Age (days ago)

4532

Last active (days ago)

arm@lists.fedoraproject.org

9 comments

4 participants

tags (0)

participants (4)

Andrew Haley
Gordan Bobic
Henrik Nordström
Peter Robinson