New subject: Kernel-3.1 Crash

27 Oct 2011


      ...
do you have multipath configured on your box?
If i have understand the 'multipath concept', yes. fdisk
outputhttp://www.fpaste.org/KXvm/
...
How often can you reproduce this problem.
Only with Kernel 3.1.
If fsck is enabled on / partition (btrfs filesystem) also with Kernel 3.0
2011/10/27 Vivek Goyal vgoyal@redhat.com
...
On Thu, Oct 27, 2011 at 09:31:13PM +0200, Antonio Trande wrote:
...
Should i be the "victim" ? :)
If need tests, i'm available.
do you have multipath configured on your box? How often can you reproduce
this problem.  Can you reproduce the problem with single cpu in the
system.
Thanks
Vivek
...
2011/10/27 Vivek Goyal vgoyal@redhat.com
...
On Thu, Oct 27, 2011 at 03:20:51PM -0400, Jeff Moyer wrote:
...
Don Zickus dzickus@redhat.com writes:
...
On Thu, Oct 27, 2011 at 02:43:22PM -0400, Jeff Moyer wrote:
...
>> This doesn't look like the same problem.  Here we've got BUG:
scheduling
...
...
...
>> while atomic.  If it was the bug fixed by the above commits,
then
...
...
you
...
...
...
>> would hit a BUG_ON.  I would start looking at the btrfs bits to
see
...
...
if
...
...
...
>> they're holding any locks in this code path.
>
> Ignore that one and move to IMG_0350.IMG.  'scheduling while
atomic'
...
...
is
...
...
...
> just noise.  Besides Mike and Vivek told me to blame you for not
pushing
...
...
...
> Jens harder on these fixes. :-)))))
I'm looking at 0355, which shows the very top of the trace, and
that
...
...
...
...
...
says BUG: scheduling while atomic.  So the problem reported here
*is*
...
...
...
...
...
different from the one fixed by the above two commits.  In fact, I
don't
...
...
...
see evidence of the multipath + flush issue in any of these
pictures.
...
...
...
...
You have to ignore the 'schedule while atomic' thing it is just a
printk("BUG: scheduling while atomic"), it is _not_ a BUG().  :-)
(hint read kernel/sched.c::__schedule_bug)
I see those messages all the time, it really should be a WARN and
not a
...
...
...
...
misleading BUG, but whatever.
His machine died because the NMI watchdog detected a lockup.  The
lockup
...
...
was because in blk_insert_cloned_request(), spin_lock_irqsave
disabled
...
...
...
...
interrupts and spun forever waiting on the q->queue_lock
(IMG_0350.JPG).
...
...
Mike and Vivek both said that is what you fixed for 3.2.  They also
said
...
...
the only caller of blk_insert_cloned_request() is multipath, hence
that
...
...
...
...
argument.  I'll cc them.  Or maybe I can have them walk over to
your
...
...
cube.
...
...
:-)
Well then they know more than I do.  The bug I fixed would not result
in
...
...
...
infinite spinning on the queue lock.  It resulted in a BUG_ON in
blk_insert_flush, since req->bio was NULL.  So again, I really don't
see
...
...
...
how this is related.  We could put this all to rest by asking the
victim
...
...
...
to try out those two patches.
Sorry for the confusion here. We saw the blk_insert_cloned_request() in
the trace and thought it could be related to your fixes. Did not think
about exact symtom of the problem in your case. So you are right. Here
we are spinning on spinlock infinitely and your patch fixed the
BUG_ON().
...
...
So may be it is a different issue.
Thanks
Vivek
--
*Antonio Trande
"Fedora Ambassador"
**mail*: mailto:sagitter@fedoraproject.org sagitter@fedoraproject.org
*Homepage*: http://www.fedora-os.org
*Sip Address* : sip:sagitter AT ekiga.net
*Jabber http://jabber.org/* :sagitter AT jabber.org
*GPG Key: CFE3479C*
-- 
*Antonio Trande
"Fedora Ambassador"

**mail*: mailto:sagitter@fedoraproject.org sagitter@fedoraproject.org
*Homepage*: http://www.fedora-os.org
*Sip Address* : sip:sagitter AT ekiga.net
*Jabber http://jabber.org/* :sagitter AT jabber.org
*GPG Key: CFE3479C*

Re: Kernel-3.1 Crash