Re: Kernel-3.1 Crash

List overview All Threads
Download

newer

older

monthly irc meeting.

Fwd: Kernel-3.1 Crash

Antonio Trande

27 Oct 2011 27 Oct '11

11:59 a.m.

The digital picture are here: http://sagitter.fedorapeople.org/kernel-boot.tar.gz

2011/10/27 Don Zickus dzickus@redhat.com

...

On Thu, Oct 27, 2011 at 04:56:27PM +0200, Antonio Trande wrote:

...
Ok. I've made a video [1].

[1] http://sagitter.fedorapeople.org/MOV03009.AVI

The video is to fuzzy and moving around to much for me to see the pieces I need to see. It is easier to take a picture (2 or 3 probably). Also make sure you cc the fedora list as someone there might be able to notice something and help out.

Cheers, Don

...
2011/10/27 Don Zickus dzickus@redhat.com

...
On Thu, Oct 27, 2011 at 01:50:17PM +0200, Antonio Trande wrote:

...
Hello.

After to have installed the lastest kernel (3.1-0-1) on my Fedora 16 x86_64bit, i get a kernel crash during boot. Now the main problem is how to recover 'boot log' or 'messages log'

in

...
...
order

...
to ask help because it isnt created.

Can you take a digital picture? Attaching that here will work (don't

make

...
...
the size very large). Or if your machine has a serial console, you can hook it up to another machine and run minicom to collect the output

(after

...
...
configuring the serial console on the grub command line, I need to dig

up

...
...
a documentation link to assist).

Cheers, Don

-- *Antonio Trande "Fedora Ambassador"

**mail*: mailto:sagitter@fedoraproject.org sagitter@fedoraproject.org *Homepage*: http://www.fedora-os.org *Sip Address* : sip:sagitter AT ekiga.net *Jabber http://jabber.org/* :sagitter AT jabber.org *GPG Key: CFE3479C*

-- *Antonio Trande "Fedora Ambassador" **mail*: mailto:sagitter@fedoraproject.org sagitter@fedoraproject.org *Homepage*: http://www.fedora-os.org *Sip Address* : sip:sagitter AT ekiga.net *Jabber http://jabber.org/* :sagitter AT jabber.org *GPG Key: CFE3479C*

Show replies by date

Don Zickus

27 Oct 27 Oct

1:10 p.m.

New subject: Kernel-3.1 Crash

On Thu, Oct 27, 2011 at 06:59:12PM +0200, Antonio Trande wrote:

...

The digital picture are here: http://sagitter.fedorapeople.org/kernel-boot.tar.gz

According to our DM guys this fix just missed 3.1

http://git.kernel.dk/?p=linux-block.git;a=commit;h=f26d8f0562da76731cb049943... http://git.kernel.dk/?p=linux-block.git;a=commit;h=8f02b3a09b1b7d2a4d24b8cd7...

I cc'd Jeff Moyer he knows about it. The DM folks assume that you are using multipath (as that is the only way you can hit this bug from their point of view).

Cheers, Don

Jeff Moyer

1:29 p.m.

New subject: Kernel-3.1 Crash

Don Zickus dzickus@redhat.com writes:

...

On Thu, Oct 27, 2011 at 06:59:12PM +0200, Antonio Trande wrote:

...
The digital picture are here: http://sagitter.fedorapeople.org/kernel-boot.tar.gz

According to our DM guys this fix just missed 3.1

http://git.kernel.dk/?p=linux-block.git;a=commit;h=f26d8f0562da76731cb049943... http://git.kernel.dk/?p=linux-block.git;a=commit;h=8f02b3a09b1b7d2a4d24b8cd7...

I cc'd Jeff Moyer he knows about it. The DM folks assume that you are using multipath (as that is the only way you can hit this bug from their point of view).

This doesn't look like the same problem. Here we've got BUG: scheduling while atomic. If it was the bug fixed by the above commits, then you would hit a BUG_ON. I would start looking at the btrfs bits to see if they're holding any locks in this code path.

Cheers, Jeff

Don Zickus

1:37 p.m.

New subject: Kernel-3.1 Crash

On Thu, Oct 27, 2011 at 02:29:56PM -0400, Jeff Moyer wrote:

...

Don Zickus dzickus@redhat.com writes:

...
On Thu, Oct 27, 2011 at 06:59:12PM +0200, Antonio Trande wrote:

...
The digital picture are here: http://sagitter.fedorapeople.org/kernel-boot.tar.gz

According to our DM guys this fix just missed 3.1

http://git.kernel.dk/?p=linux-block.git;a=commit;h=f26d8f0562da76731cb049943... http://git.kernel.dk/?p=linux-block.git;a=commit;h=8f02b3a09b1b7d2a4d24b8cd7...

I cc'd Jeff Moyer he knows about it. The DM folks assume that you are using multipath (as that is the only way you can hit this bug from their point of view).

This doesn't look like the same problem. Here we've got BUG: scheduling while atomic. If it was the bug fixed by the above commits, then you would hit a BUG_ON. I would start looking at the btrfs bits to see if they're holding any locks in this code path.

Ignore that one and move to IMG_0350.IMG. 'scheduling while atomic' is just noise. Besides Mike and Vivek told me to blame you for not pushing Jens harder on these fixes. :-)))))

Cheers, Don

Jeff Moyer

1:43 p.m.

New subject: Kernel-3.1 Crash

Don Zickus dzickus@redhat.com writes:

...

On Thu, Oct 27, 2011 at 02:29:56PM -0400, Jeff Moyer wrote:

...
Don Zickus dzickus@redhat.com writes:

...
On Thu, Oct 27, 2011 at 06:59:12PM +0200, Antonio Trande wrote:

...
The digital picture are here: http://sagitter.fedorapeople.org/kernel-boot.tar.gz

According to our DM guys this fix just missed 3.1

http://git.kernel.dk/?p=linux-block.git;a=commit;h=f26d8f0562da76731cb049943... http://git.kernel.dk/?p=linux-block.git;a=commit;h=8f02b3a09b1b7d2a4d24b8cd7...

I cc'd Jeff Moyer he knows about it. The DM folks assume that you are using multipath (as that is the only way you can hit this bug from their point of view).

This doesn't look like the same problem. Here we've got BUG: scheduling while atomic. If it was the bug fixed by the above commits, then you would hit a BUG_ON. I would start looking at the btrfs bits to see if they're holding any locks in this code path.

Ignore that one and move to IMG_0350.IMG. 'scheduling while atomic' is just noise. Besides Mike and Vivek told me to blame you for not pushing Jens harder on these fixes. :-)))))

I'm looking at 0355, which shows the very top of the trace, and that says BUG: scheduling while atomic. So the problem reported here *is* different from the one fixed by the above two commits. In fact, I don't see evidence of the multipath + flush issue in any of these pictures.

Cheers, Jeff

Don Zickus

2:09 p.m.

New subject: Kernel-3.1 Crash

On Thu, Oct 27, 2011 at 02:43:22PM -0400, Jeff Moyer wrote:

...

...
...
This doesn't look like the same problem. Here we've got BUG: scheduling while atomic. If it was the bug fixed by the above commits, then you would hit a BUG_ON. I would start looking at the btrfs bits to see if they're holding any locks in this code path.

Ignore that one and move to IMG_0350.IMG. 'scheduling while atomic' is just noise. Besides Mike and Vivek told me to blame you for not pushing Jens harder on these fixes. :-)))))

I'm looking at 0355, which shows the very top of the trace, and that says BUG: scheduling while atomic. So the problem reported here *is* different from the one fixed by the above two commits. In fact, I don't see evidence of the multipath + flush issue in any of these pictures.

You have to ignore the 'schedule while atomic' thing it is just a

printk("BUG: scheduling while atomic"), it is _not_ a BUG(). :-) (hint read kernel/sched.c::__schedule_bug)

I see those messages all the time, it really should be a WARN and not a misleading BUG, but whatever.

His machine died because the NMI watchdog detected a lockup. The lockup was because in blk_insert_cloned_request(), spin_lock_irqsave disabled interrupts and spun forever waiting on the q->queue_lock (IMG_0350.JPG).

Mike and Vivek both said that is what you fixed for 3.2. They also said the only caller of blk_insert_cloned_request() is multipath, hence that argument. I'll cc them. Or maybe I can have them walk over to your cube. :-)

Cheers, Don

Jeff Moyer

2:20 p.m.

New subject: Kernel-3.1 Crash

Don Zickus dzickus@redhat.com writes:

...

On Thu, Oct 27, 2011 at 02:43:22PM -0400, Jeff Moyer wrote:

...
...
...
This doesn't look like the same problem. Here we've got BUG: scheduling while atomic. If it was the bug fixed by the above commits, then you would hit a BUG_ON. I would start looking at the btrfs bits to see if they're holding any locks in this code path.

Ignore that one and move to IMG_0350.IMG. 'scheduling while atomic' is just noise. Besides Mike and Vivek told me to blame you for not pushing Jens harder on these fixes. :-)))))

I'm looking at 0355, which shows the very top of the trace, and that says BUG: scheduling while atomic. So the problem reported here *is* different from the one fixed by the above two commits. In fact, I don't see evidence of the multipath + flush issue in any of these pictures.

You have to ignore the 'schedule while atomic' thing it is just a

printk("BUG: scheduling while atomic"), it is _not_ a BUG(). :-) (hint read kernel/sched.c::__schedule_bug)

I see those messages all the time, it really should be a WARN and not a misleading BUG, but whatever.

His machine died because the NMI watchdog detected a lockup. The lockup was because in blk_insert_cloned_request(), spin_lock_irqsave disabled interrupts and spun forever waiting on the q->queue_lock (IMG_0350.JPG).

Mike and Vivek both said that is what you fixed for 3.2. They also said the only caller of blk_insert_cloned_request() is multipath, hence that argument. I'll cc them. Or maybe I can have them walk over to your cube. :-)

Well then they know more than I do. The bug I fixed would not result in infinite spinning on the queue lock. It resulted in a BUG_ON in blk_insert_flush, since req->bio was NULL. So again, I really don't see how this is related. We could put this all to rest by asking the victim to try out those two patches.

Cheers, Jeff

Vivek Goyal

2:25 p.m.

New subject: Kernel-3.1 Crash

On Thu, Oct 27, 2011 at 03:20:51PM -0400, Jeff Moyer wrote:

...

Don Zickus dzickus@redhat.com writes:

...
On Thu, Oct 27, 2011 at 02:43:22PM -0400, Jeff Moyer wrote:

...
...
...
This doesn't look like the same problem. Here we've got BUG: scheduling while atomic. If it was the bug fixed by the above commits, then you would hit a BUG_ON. I would start looking at the btrfs bits to see if they're holding any locks in this code path.

Ignore that one and move to IMG_0350.IMG. 'scheduling while atomic' is just noise. Besides Mike and Vivek told me to blame you for not pushing Jens harder on these fixes. :-)))))

I'm looking at 0355, which shows the very top of the trace, and that says BUG: scheduling while atomic. So the problem reported here *is* different from the one fixed by the above two commits. In fact, I don't see evidence of the multipath + flush issue in any of these pictures.

You have to ignore the 'schedule while atomic' thing it is just a

printk("BUG: scheduling while atomic"), it is _not_ a BUG(). :-) (hint read kernel/sched.c::__schedule_bug)

I see those messages all the time, it really should be a WARN and not a misleading BUG, but whatever.

His machine died because the NMI watchdog detected a lockup. The lockup was because in blk_insert_cloned_request(), spin_lock_irqsave disabled interrupts and spun forever waiting on the q->queue_lock (IMG_0350.JPG).

Mike and Vivek both said that is what you fixed for 3.2. They also said the only caller of blk_insert_cloned_request() is multipath, hence that argument. I'll cc them. Or maybe I can have them walk over to your cube. :-)

Well then they know more than I do. The bug I fixed would not result in infinite spinning on the queue lock. It resulted in a BUG_ON in blk_insert_flush, since req->bio was NULL. So again, I really don't see how this is related. We could put this all to rest by asking the victim to try out those two patches.

Sorry for the confusion here. We saw the blk_insert_cloned_request() in the trace and thought it could be related to your fixes. Did not think about exact symtom of the problem in your case. So you are right. Here we are spinning on spinlock infinitely and your patch fixed the BUG_ON(). So may be it is a different issue.

Thanks Vivek

Vivek Goyal

2:36 p.m.

New subject: Kernel-3.1 Crash

On Thu, Oct 27, 2011 at 03:09:05PM -0400, Don Zickus wrote:

...

On Thu, Oct 27, 2011 at 02:43:22PM -0400, Jeff Moyer wrote:

...
...
...
This doesn't look like the same problem. Here we've got BUG: scheduling while atomic. If it was the bug fixed by the above commits, then you would hit a BUG_ON. I would start looking at the btrfs bits to see if they're holding any locks in this code path.

Ignore that one and move to IMG_0350.IMG. 'scheduling while atomic' is just noise. Besides Mike and Vivek told me to blame you for not pushing Jens harder on these fixes. :-)))))

I'm looking at 0355, which shows the very top of the trace, and that says BUG: scheduling while atomic. So the problem reported here *is* different from the one fixed by the above two commits. In fact, I don't see evidence of the multipath + flush issue in any of these pictures.

You have to ignore the 'schedule while atomic' thing it is just a

printk("BUG: scheduling while atomic"), it is _not_ a BUG(). :-) (hint read kernel/sched.c::__schedule_bug)

May be thread holding the queue lock got scheduled out hence leading to deadlock. ?

Thanks Vivek

Jeff Moyer

2:44 p.m.

New subject: Kernel-3.1 Crash

Vivek Goyal vgoyal@redhat.com writes:

...

On Thu, Oct 27, 2011 at 03:09:05PM -0400, Don Zickus wrote:

...
On Thu, Oct 27, 2011 at 02:43:22PM -0400, Jeff Moyer wrote:

...
...
...
This doesn't look like the same problem. Here we've got BUG: scheduling while atomic. If it was the bug fixed by the above commits, then you would hit a BUG_ON. I would start looking at the btrfs bits to see if they're holding any locks in this code path.

Ignore that one and move to IMG_0350.IMG. 'scheduling while atomic' is just noise. Besides Mike and Vivek told me to blame you for not pushing Jens harder on these fixes. :-)))))

I'm looking at 0355, which shows the very top of the trace, and that says BUG: scheduling while atomic. So the problem reported here *is* different from the one fixed by the above two commits. In fact, I don't see evidence of the multipath + flush issue in any of these pictures.

You have to ignore the 'schedule while atomic' thing it is just a

printk("BUG: scheduling while atomic"), it is _not_ a BUG(). :-) (hint read kernel/sched.c::__schedule_bug)

May be thread holding the queue lock got scheduled out hence leading to deadlock. ?

Assuming all of these messages were from the same boot, the scheduling while atomic message actually came *after* the nmi lockup detection logic fired.

Is there any more information available on this bug? Is it reproducible? What is the storage configuration?

-Jeff

4572

Age (days ago)

4572

Last active (days ago)

kernel@lists.fedoraproject.org

9 comments

4 participants

tags (0)

participants (4)

Antonio Trande
Don Zickus
Jeff Moyer
Vivek Goyal