do you have multipath configured on your box?
If i have understand the 'multipath concept', yes. fdisk outputhttp://www.fpaste.org/KXvm/
How often can you reproduce this problem.
Only with Kernel 3.1. If fsck is enabled on / partition (btrfs filesystem) also with Kernel 3.0
2011/10/27 Vivek Goyal vgoyal@redhat.com
On Thu, Oct 27, 2011 at 09:31:13PM +0200, Antonio Trande wrote:
Should i be the "victim" ? :) If need tests, i'm available.
do you have multipath configured on your box? How often can you reproduce this problem. Can you reproduce the problem with single cpu in the system.
Thanks Vivek
2011/10/27 Vivek Goyal vgoyal@redhat.com
On Thu, Oct 27, 2011 at 03:20:51PM -0400, Jeff Moyer wrote:
Don Zickus dzickus@redhat.com writes:
On Thu, Oct 27, 2011 at 02:43:22PM -0400, Jeff Moyer wrote:
>> This doesn't look like the same problem. Here we've got BUG:
scheduling
>> while atomic. If it was the bug fixed by the above commits,
then
you
>> would hit a BUG_ON. I would start looking at the btrfs bits to
see
if
>> they're holding any locks in this code path. > > Ignore that one and move to IMG_0350.IMG. 'scheduling while
atomic'
is
> just noise. Besides Mike and Vivek told me to blame you for not
pushing
> Jens harder on these fixes. :-)))))
I'm looking at 0355, which shows the very top of the trace, and
that
says BUG: scheduling while atomic. So the problem reported here
*is*
different from the one fixed by the above two commits. In fact, I
don't
see evidence of the multipath + flush issue in any of these
pictures.
You have to ignore the 'schedule while atomic' thing it is just a
printk("BUG: scheduling while atomic"), it is _not_ a BUG(). :-) (hint read kernel/sched.c::__schedule_bug)
I see those messages all the time, it really should be a WARN and
not a
misleading BUG, but whatever.
His machine died because the NMI watchdog detected a lockup. The
lockup
was because in blk_insert_cloned_request(), spin_lock_irqsave
disabled
interrupts and spun forever waiting on the q->queue_lock
(IMG_0350.JPG).
Mike and Vivek both said that is what you fixed for 3.2. They also
said
the only caller of blk_insert_cloned_request() is multipath, hence
that
argument. I'll cc them. Or maybe I can have them walk over to
your
cube.
:-)
Well then they know more than I do. The bug I fixed would not result
in
infinite spinning on the queue lock. It resulted in a BUG_ON in blk_insert_flush, since req->bio was NULL. So again, I really don't
see
how this is related. We could put this all to rest by asking the
victim
to try out those two patches.
Sorry for the confusion here. We saw the blk_insert_cloned_request() in the trace and thought it could be related to your fixes. Did not think about exact symtom of the problem in your case. So you are right. Here we are spinning on spinlock infinitely and your patch fixed the
BUG_ON().
So may be it is a different issue.
Thanks Vivek
-- *Antonio Trande "Fedora Ambassador"
**mail*: mailto:sagitter@fedoraproject.org sagitter@fedoraproject.org *Homepage*: http://www.fedora-os.org *Sip Address* : sip:sagitter AT ekiga.net *Jabber http://jabber.org/* :sagitter AT jabber.org *GPG Key: CFE3479C*
I don't know if useful but during boot with kernel 3.0 appears:
$ dmesg | grep multipath
[ 4.113786] device-mapper: multipath: version 1.3.0 loaded [ 4.164462] device-mapper: multipath round-robin: version 1.0.0 loaded [ 35.443230] multipathd[1184]: /lib/udev/scsi_id exitted with 1 [ 35.443682] multipathd[1184]: /lib/udev/scsi_id exitted with 1
Must i consider this problem as a kernel 3.1 bug ? I don't know where come from this multipath configuration, i have always done simple Fedora installations.
Thanks.
2011/10/27 Antonio Trande anto.trande@gmail.com
do you have multipath configured on your box?
If i have understand the 'multipath concept', yes. fdisk outputhttp://www.fpaste.org/KXvm/
How often can you reproduce this problem.
Only with Kernel 3.1. If fsck is enabled on / partition (btrfs filesystem) also with Kernel 3.0
2011/10/27 Vivek Goyal vgoyal@redhat.com
On Thu, Oct 27, 2011 at 09:31:13PM +0200, Antonio Trande wrote:
Should i be the "victim" ? :) If need tests, i'm available.
do you have multipath configured on your box? How often can you reproduce this problem. Can you reproduce the problem with single cpu in the system.
Thanks Vivek
2011/10/27 Vivek Goyal vgoyal@redhat.com
On Thu, Oct 27, 2011 at 03:20:51PM -0400, Jeff Moyer wrote:
Don Zickus dzickus@redhat.com writes:
On Thu, Oct 27, 2011 at 02:43:22PM -0400, Jeff Moyer wrote: > >> This doesn't look like the same problem. Here we've got BUG:
scheduling
> >> while atomic. If it was the bug fixed by the above commits,
then
you
> >> would hit a BUG_ON. I would start looking at the btrfs bits
to see
if
> >> they're holding any locks in this code path. > > > > Ignore that one and move to IMG_0350.IMG. 'scheduling while
atomic'
is
> > just noise. Besides Mike and Vivek told me to blame you for
not
pushing
> > Jens harder on these fixes. :-))))) > > I'm looking at 0355, which shows the very top of the trace, and
that
> says BUG: scheduling while atomic. So the problem reported here
*is*
> different from the one fixed by the above two commits. In fact,
I
don't
> see evidence of the multipath + flush issue in any of these
pictures.
You have to ignore the 'schedule while atomic' thing it is just a
printk("BUG: scheduling while atomic"), it is _not_ a BUG(). :-) (hint read kernel/sched.c::__schedule_bug)
I see those messages all the time, it really should be a WARN and
not a
misleading BUG, but whatever.
His machine died because the NMI watchdog detected a lockup. The
lockup
was because in blk_insert_cloned_request(), spin_lock_irqsave
disabled
interrupts and spun forever waiting on the q->queue_lock
(IMG_0350.JPG).
Mike and Vivek both said that is what you fixed for 3.2. They
also
said
the only caller of blk_insert_cloned_request() is multipath, hence
that
argument. I'll cc them. Or maybe I can have them walk over to
your
cube.
:-)
Well then they know more than I do. The bug I fixed would not
result in
infinite spinning on the queue lock. It resulted in a BUG_ON in blk_insert_flush, since req->bio was NULL. So again, I really don't
see
how this is related. We could put this all to rest by asking the
victim
to try out those two patches.
Sorry for the confusion here. We saw the blk_insert_cloned_request()
in
the trace and thought it could be related to your fixes. Did not think about exact symtom of the problem in your case. So you are right. Here we are spinning on spinlock infinitely and your patch fixed the
BUG_ON().
So may be it is a different issue.
Thanks Vivek
-- *Antonio Trande "Fedora Ambassador"
**mail*: mailto:sagitter@fedoraproject.org sagitter@fedoraproject.org *Homepage*: http://www.fedora-os.org *Sip Address* : sip:sagitter AT ekiga.net *Jabber http://jabber.org/* :sagitter AT jabber.org *GPG Key: CFE3479C*
-- *Antonio Trande "Fedora Ambassador"
**mail*: mailto:sagitter@fedoraproject.org sagitter@fedoraproject.org *Homepage*: http://www.fedora-os.org *Sip Address* : sip:sagitter AT ekiga.net *Jabber http://jabber.org/* :sagitter AT jabber.org *GPG Key: CFE3479C*
On Fri, Oct 28 2011 at 10:30am -0400, Antonio Trande anto.trande@gmail.com wrote:
I don't know if useful but during boot with kernel 3.0 appears:
$ dmesg | grep multipath
[ 4.113786] device-mapper: multipath: version 1.3.0 loaded [ 4.164462] device-mapper: multipath round-robin: version 1.0.0 loaded [ 35.443230] multipathd[1184]: /lib/udev/scsi_id exitted with 1 [ 35.443682] multipathd[1184]: /lib/udev/scsi_id exitted with 1
Must i consider this problem as a kernel 3.1 bug ? I don't know where come from this multipath configuration, i have always done simple Fedora installations.
You said that you saw the problem with Linux 3.0 too though? But only if fsck is enabled.. yet btrfs doesn't yet have a publicly available fsck... so you need to clarify your 3.0 comment earlier.
That aside, I'd imagine that anaconda unnecessarily introduced a multipath layer for your storage when you really don't have multiple paths. What does 'multipath -ll' show?
2011/10/27 Antonio Trande anto.trande@gmail.com
do you have multipath configured on your box?
If i have understand the 'multipath concept', yes. fdisk outputhttp://www.fpaste.org/KXvm/
How often can you reproduce this problem.
Only with Kernel 3.1. If fsck is enabled on / partition (btrfs filesystem) also with Kernel 3.0
You said that you saw the problem with Linux 3.0 too though? But only if fsck is enabled.. yet btrfs doesn't yet have a publicly available fsck... so you need to clarify your 3.0 comment earlier.
It's only my doubt. Only one time i've tried to enable fsck on btrfs in fstab with Kernel 3.0 and seems to get same problem (but i'm not certain).
That aside, I'd imagine that anaconda unnecessarily introduced a multipath layer for your storage when you really don't have multiple paths. What does 'multipath -ll' show?
# multipath -ll mpatha (350014ee25794e366) dm-0 ATA,WDC WD5000BEVT-2 size=466G features='0' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=1 status=active `- 0:0:0:0 sda 8:0 active ready running
2011/10/28 Mike Snitzer snitzer@redhat.com
On Fri, Oct 28 2011 at 10:30am -0400, Antonio Trande anto.trande@gmail.com wrote:
I don't know if useful but during boot with kernel 3.0 appears:
$ dmesg | grep multipath
[ 4.113786] device-mapper: multipath: version 1.3.0 loaded [ 4.164462] device-mapper: multipath round-robin: version 1.0.0
loaded
[ 35.443230] multipathd[1184]: /lib/udev/scsi_id exitted with 1 [ 35.443682] multipathd[1184]: /lib/udev/scsi_id exitted with 1
Must i consider this problem as a kernel 3.1 bug ? I don't know where come from this multipath configuration, i have always done simple Fedora installations.
You said that you saw the problem with Linux 3.0 too though? But only if fsck is enabled.. yet btrfs doesn't yet have a publicly available fsck... so you need to clarify your 3.0 comment earlier.
That aside, I'd imagine that anaconda unnecessarily introduced a multipath layer for your storage when you really don't have multiple paths. What does 'multipath -ll' show?
2011/10/27 Antonio Trande anto.trande@gmail.com
do you have multipath configured on your box?
If i have understand the 'multipath concept', yes. fdisk output<
http://www.fpaste.org/KXvm/%3E
How often can you reproduce this problem.
Only with Kernel 3.1. If fsck is enabled on / partition (btrfs filesystem) also with Kernel
3.0
On Fri, Oct 28 2011 at 11:11am -0400, Antonio Trande anto.trande@gmail.com wrote:
You said that you saw the problem with Linux 3.0 too though? But only if fsck is enabled.. yet btrfs doesn't yet have a publicly available fsck... so you need to clarify your 3.0 comment earlier.
It's only my doubt. Only one time i've tried to enable fsck on btrfs in fstab with Kernel 3.0 and seems to get same problem (but i'm not certain).
again there is no fsck for btrfs yet. so this doesn't make sense.
That aside, I'd imagine that anaconda unnecessarily introduced a multipath layer for your storage when you really don't have multiple paths. What does 'multipath -ll' show?
# multipath -ll mpatha (350014ee25794e366) dm-0 ATA,WDC WD5000BEVT-2 size=466G features='0' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=1 status=active `- 0:0:0:0 sda 8:0 active ready running
Yeah, anaconda should _not_ have introduced multipath for your setup. It is a layer of complexity that you are not benefitting from (worse: it is somehow causing instability for you with your btrfs config).
It is possible to rebuild your initramfs (using dracut) so it does _not_ have multipath enabled.
I think the easiest would be to simply uninstall the 'device-mapper-multipath' package and make sure /etc/multipath/ and /etc/multipath.conf no longer exists.
So when you re-create the initramfs with dracut it won't be able to copy any multipath enabling bits into the initramfs.
Mike
again there is no fsck for btrfs yet. so this doesn't make sense.
You have ask me clarifications.
I've just realized. Sorry. :)
2011/10/28 Mike Snitzer snitzer@redhat.com
On Fri, Oct 28 2011 at 11:11am -0400, Antonio Trande anto.trande@gmail.com wrote:
You said that you saw the problem with Linux 3.0 too though? But only if fsck is enabled.. yet btrfs doesn't yet have a publicly available fsck... so you need to clarify your 3.0 comment earlier.
It's only my doubt. Only one time i've tried to enable fsck on btrfs in fstab with Kernel 3.0 and seems to get same problem (but i'm not certain).
again there is no fsck for btrfs yet. so this doesn't make sense.
That aside, I'd imagine that anaconda unnecessarily introduced a multipath layer for your storage when you really don't have multiple paths. What does 'multipath -ll' show?
# multipath -ll mpatha (350014ee25794e366) dm-0 ATA,WDC WD5000BEVT-2 size=466G features='0' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=1 status=active `- 0:0:0:0 sda 8:0 active ready running
Yeah, anaconda should _not_ have introduced multipath for your setup. It is a layer of complexity that you are not benefitting from (worse: it is somehow causing instability for you with your btrfs config).
It is possible to rebuild your initramfs (using dracut) so it does _not_ have multipath enabled.
I think the easiest would be to simply uninstall the 'device-mapper-multipath' package and make sure /etc/multipath/ and /etc/multipath.conf no longer exists.
So when you re-create the initramfs with dracut it won't be able to copy any multipath enabling bits into the initramfs.
Mike
kernel@lists.fedoraproject.org