The digital picture are here: http://sagitter.fedorapeople.org/kernel-boot.tar.gz
2011/10/27 Don Zickus dzickus@redhat.com
On Thu, Oct 27, 2011 at 04:56:27PM +0200, Antonio Trande wrote:
Ok. I've made a video [1].
The video is to fuzzy and moving around to much for me to see the pieces I need to see. It is easier to take a picture (2 or 3 probably). Also make sure you cc the fedora list as someone there might be able to notice something and help out.
Cheers, Don
2011/10/27 Don Zickus dzickus@redhat.com
On Thu, Oct 27, 2011 at 01:50:17PM +0200, Antonio Trande wrote:
Hello.
After to have installed the lastest kernel (3.1-0-1) on my Fedora 16 x86_64bit, i get a kernel crash during boot. Now the main problem is how to recover 'boot log' or 'messages log'
in
order
to ask help because it isnt created.
Can you take a digital picture? Attaching that here will work (don't
make
the size very large). Or if your machine has a serial console, you can hook it up to another machine and run minicom to collect the output
(after
configuring the serial console on the grub command line, I need to dig
up
a documentation link to assist).
Cheers, Don
-- *Antonio Trande "Fedora Ambassador"
**mail*: mailto:sagitter@fedoraproject.org sagitter@fedoraproject.org *Homepage*: http://www.fedora-os.org *Sip Address* : sip:sagitter AT ekiga.net *Jabber http://jabber.org/* :sagitter AT jabber.org *GPG Key: CFE3479C*
On Thu, Oct 27, 2011 at 06:59:12PM +0200, Antonio Trande wrote:
The digital picture are here: http://sagitter.fedorapeople.org/kernel-boot.tar.gz
According to our DM guys this fix just missed 3.1
http://git.kernel.dk/?p=linux-block.git;a=commit;h=f26d8f0562da76731cb049943... http://git.kernel.dk/?p=linux-block.git;a=commit;h=8f02b3a09b1b7d2a4d24b8cd7...
I cc'd Jeff Moyer he knows about it. The DM folks assume that you are using multipath (as that is the only way you can hit this bug from their point of view).
Cheers, Don
Don Zickus dzickus@redhat.com writes:
On Thu, Oct 27, 2011 at 06:59:12PM +0200, Antonio Trande wrote:
The digital picture are here: http://sagitter.fedorapeople.org/kernel-boot.tar.gz
According to our DM guys this fix just missed 3.1
http://git.kernel.dk/?p=linux-block.git;a=commit;h=f26d8f0562da76731cb049943... http://git.kernel.dk/?p=linux-block.git;a=commit;h=8f02b3a09b1b7d2a4d24b8cd7...
I cc'd Jeff Moyer he knows about it. The DM folks assume that you are using multipath (as that is the only way you can hit this bug from their point of view).
This doesn't look like the same problem. Here we've got BUG: scheduling while atomic. If it was the bug fixed by the above commits, then you would hit a BUG_ON. I would start looking at the btrfs bits to see if they're holding any locks in this code path.
Cheers, Jeff
On Thu, Oct 27, 2011 at 02:29:56PM -0400, Jeff Moyer wrote:
Don Zickus dzickus@redhat.com writes:
On Thu, Oct 27, 2011 at 06:59:12PM +0200, Antonio Trande wrote:
The digital picture are here: http://sagitter.fedorapeople.org/kernel-boot.tar.gz
According to our DM guys this fix just missed 3.1
http://git.kernel.dk/?p=linux-block.git;a=commit;h=f26d8f0562da76731cb049943... http://git.kernel.dk/?p=linux-block.git;a=commit;h=8f02b3a09b1b7d2a4d24b8cd7...
I cc'd Jeff Moyer he knows about it. The DM folks assume that you are using multipath (as that is the only way you can hit this bug from their point of view).
This doesn't look like the same problem. Here we've got BUG: scheduling while atomic. If it was the bug fixed by the above commits, then you would hit a BUG_ON. I would start looking at the btrfs bits to see if they're holding any locks in this code path.
Ignore that one and move to IMG_0350.IMG. 'scheduling while atomic' is just noise. Besides Mike and Vivek told me to blame you for not pushing Jens harder on these fixes. :-)))))
Cheers, Don
Don Zickus dzickus@redhat.com writes:
On Thu, Oct 27, 2011 at 02:29:56PM -0400, Jeff Moyer wrote:
Don Zickus dzickus@redhat.com writes:
On Thu, Oct 27, 2011 at 06:59:12PM +0200, Antonio Trande wrote:
The digital picture are here: http://sagitter.fedorapeople.org/kernel-boot.tar.gz
According to our DM guys this fix just missed 3.1
http://git.kernel.dk/?p=linux-block.git;a=commit;h=f26d8f0562da76731cb049943... http://git.kernel.dk/?p=linux-block.git;a=commit;h=8f02b3a09b1b7d2a4d24b8cd7...
I cc'd Jeff Moyer he knows about it. The DM folks assume that you are using multipath (as that is the only way you can hit this bug from their point of view).
This doesn't look like the same problem. Here we've got BUG: scheduling while atomic. If it was the bug fixed by the above commits, then you would hit a BUG_ON. I would start looking at the btrfs bits to see if they're holding any locks in this code path.
Ignore that one and move to IMG_0350.IMG. 'scheduling while atomic' is just noise. Besides Mike and Vivek told me to blame you for not pushing Jens harder on these fixes. :-)))))
I'm looking at 0355, which shows the very top of the trace, and that says BUG: scheduling while atomic. So the problem reported here *is* different from the one fixed by the above two commits. In fact, I don't see evidence of the multipath + flush issue in any of these pictures.
Cheers, Jeff
On Thu, Oct 27, 2011 at 02:43:22PM -0400, Jeff Moyer wrote:
This doesn't look like the same problem. Here we've got BUG: scheduling while atomic. If it was the bug fixed by the above commits, then you would hit a BUG_ON. I would start looking at the btrfs bits to see if they're holding any locks in this code path.
Ignore that one and move to IMG_0350.IMG. 'scheduling while atomic' is just noise. Besides Mike and Vivek told me to blame you for not pushing Jens harder on these fixes. :-)))))
I'm looking at 0355, which shows the very top of the trace, and that says BUG: scheduling while atomic. So the problem reported here *is* different from the one fixed by the above two commits. In fact, I don't see evidence of the multipath + flush issue in any of these pictures.
You have to ignore the 'schedule while atomic' thing it is just a
printk("BUG: scheduling while atomic"), it is _not_ a BUG(). :-) (hint read kernel/sched.c::__schedule_bug)
I see those messages all the time, it really should be a WARN and not a misleading BUG, but whatever.
His machine died because the NMI watchdog detected a lockup. The lockup was because in blk_insert_cloned_request(), spin_lock_irqsave disabled interrupts and spun forever waiting on the q->queue_lock (IMG_0350.JPG).
Mike and Vivek both said that is what you fixed for 3.2. They also said the only caller of blk_insert_cloned_request() is multipath, hence that argument. I'll cc them. Or maybe I can have them walk over to your cube. :-)
Cheers, Don
Don Zickus dzickus@redhat.com writes:
On Thu, Oct 27, 2011 at 02:43:22PM -0400, Jeff Moyer wrote:
This doesn't look like the same problem. Here we've got BUG: scheduling while atomic. If it was the bug fixed by the above commits, then you would hit a BUG_ON. I would start looking at the btrfs bits to see if they're holding any locks in this code path.
Ignore that one and move to IMG_0350.IMG. 'scheduling while atomic' is just noise. Besides Mike and Vivek told me to blame you for not pushing Jens harder on these fixes. :-)))))
I'm looking at 0355, which shows the very top of the trace, and that says BUG: scheduling while atomic. So the problem reported here *is* different from the one fixed by the above two commits. In fact, I don't see evidence of the multipath + flush issue in any of these pictures.
You have to ignore the 'schedule while atomic' thing it is just a
printk("BUG: scheduling while atomic"), it is _not_ a BUG(). :-) (hint read kernel/sched.c::__schedule_bug)
I see those messages all the time, it really should be a WARN and not a misleading BUG, but whatever.
His machine died because the NMI watchdog detected a lockup. The lockup was because in blk_insert_cloned_request(), spin_lock_irqsave disabled interrupts and spun forever waiting on the q->queue_lock (IMG_0350.JPG).
Mike and Vivek both said that is what you fixed for 3.2. They also said the only caller of blk_insert_cloned_request() is multipath, hence that argument. I'll cc them. Or maybe I can have them walk over to your cube. :-)
Well then they know more than I do. The bug I fixed would not result in infinite spinning on the queue lock. It resulted in a BUG_ON in blk_insert_flush, since req->bio was NULL. So again, I really don't see how this is related. We could put this all to rest by asking the victim to try out those two patches.
Cheers, Jeff
On Thu, Oct 27, 2011 at 03:20:51PM -0400, Jeff Moyer wrote:
Don Zickus dzickus@redhat.com writes:
On Thu, Oct 27, 2011 at 02:43:22PM -0400, Jeff Moyer wrote:
This doesn't look like the same problem. Here we've got BUG: scheduling while atomic. If it was the bug fixed by the above commits, then you would hit a BUG_ON. I would start looking at the btrfs bits to see if they're holding any locks in this code path.
Ignore that one and move to IMG_0350.IMG. 'scheduling while atomic' is just noise. Besides Mike and Vivek told me to blame you for not pushing Jens harder on these fixes. :-)))))
I'm looking at 0355, which shows the very top of the trace, and that says BUG: scheduling while atomic. So the problem reported here *is* different from the one fixed by the above two commits. In fact, I don't see evidence of the multipath + flush issue in any of these pictures.
You have to ignore the 'schedule while atomic' thing it is just a
printk("BUG: scheduling while atomic"), it is _not_ a BUG(). :-) (hint read kernel/sched.c::__schedule_bug)
I see those messages all the time, it really should be a WARN and not a misleading BUG, but whatever.
His machine died because the NMI watchdog detected a lockup. The lockup was because in blk_insert_cloned_request(), spin_lock_irqsave disabled interrupts and spun forever waiting on the q->queue_lock (IMG_0350.JPG).
Mike and Vivek both said that is what you fixed for 3.2. They also said the only caller of blk_insert_cloned_request() is multipath, hence that argument. I'll cc them. Or maybe I can have them walk over to your cube. :-)
Well then they know more than I do. The bug I fixed would not result in infinite spinning on the queue lock. It resulted in a BUG_ON in blk_insert_flush, since req->bio was NULL. So again, I really don't see how this is related. We could put this all to rest by asking the victim to try out those two patches.
Sorry for the confusion here. We saw the blk_insert_cloned_request() in the trace and thought it could be related to your fixes. Did not think about exact symtom of the problem in your case. So you are right. Here we are spinning on spinlock infinitely and your patch fixed the BUG_ON(). So may be it is a different issue.
Thanks Vivek
On Thu, Oct 27, 2011 at 03:09:05PM -0400, Don Zickus wrote:
On Thu, Oct 27, 2011 at 02:43:22PM -0400, Jeff Moyer wrote:
This doesn't look like the same problem. Here we've got BUG: scheduling while atomic. If it was the bug fixed by the above commits, then you would hit a BUG_ON. I would start looking at the btrfs bits to see if they're holding any locks in this code path.
Ignore that one and move to IMG_0350.IMG. 'scheduling while atomic' is just noise. Besides Mike and Vivek told me to blame you for not pushing Jens harder on these fixes. :-)))))
I'm looking at 0355, which shows the very top of the trace, and that says BUG: scheduling while atomic. So the problem reported here *is* different from the one fixed by the above two commits. In fact, I don't see evidence of the multipath + flush issue in any of these pictures.
You have to ignore the 'schedule while atomic' thing it is just a
printk("BUG: scheduling while atomic"), it is _not_ a BUG(). :-) (hint read kernel/sched.c::__schedule_bug)
May be thread holding the queue lock got scheduled out hence leading to deadlock. ?
Thanks Vivek
Vivek Goyal vgoyal@redhat.com writes:
On Thu, Oct 27, 2011 at 03:09:05PM -0400, Don Zickus wrote:
On Thu, Oct 27, 2011 at 02:43:22PM -0400, Jeff Moyer wrote:
This doesn't look like the same problem. Here we've got BUG: scheduling while atomic. If it was the bug fixed by the above commits, then you would hit a BUG_ON. I would start looking at the btrfs bits to see if they're holding any locks in this code path.
Ignore that one and move to IMG_0350.IMG. 'scheduling while atomic' is just noise. Besides Mike and Vivek told me to blame you for not pushing Jens harder on these fixes. :-)))))
I'm looking at 0355, which shows the very top of the trace, and that says BUG: scheduling while atomic. So the problem reported here *is* different from the one fixed by the above two commits. In fact, I don't see evidence of the multipath + flush issue in any of these pictures.
You have to ignore the 'schedule while atomic' thing it is just a
printk("BUG: scheduling while atomic"), it is _not_ a BUG(). :-) (hint read kernel/sched.c::__schedule_bug)
May be thread holding the queue lock got scheduled out hence leading to deadlock. ?
Assuming all of these messages were from the same boot, the scheduling while atomic message actually came *after* the nmi lockup detection logic fired.
Is there any more information available on this bug? Is it reproducible? What is the storage configuration?
-Jeff
kernel@lists.fedoraproject.org