Fwd: PROBLEM alert - Host fas03 is DOWN - infrastructure

List overview All Threads
Download

newer

Fwd: PROBLEM alert - Host fas03 is DOWN

older

(no subject)

Lessons learned: Initial...

Stephen John Smoogen

10 Sep 2010 10 Sep '10

8:11 p.m.

The fas servers seem to be going into a repeatable OOPS. At present all I can see doing is

/usr/sbin/xm destroy fasXX /usr/sbin/xm create fasXX

on their master server.

---------- Forwarded message ---------- From: Nagios Monitoring User nagios@fedoraproject.org Date: Fri, Sep 10, 2010 at 19:06 Subject: PROBLEM alert - Host fas03 is DOWN To: smooge+mobile@gmail.com

Host 'fas03.phx2.fedoraproject.org' is DOWN Info: CHECK_NRPE: Socket timeout after 20 seconds. Source: noc01 Time: Sat Sept 11 01:06:51 UTC 2010

-- Stephen J Smoogen. “The core skill of innovators is error recovery, not failure avoidance.” Randy Nelson, President of Pixar University. "We have a strategic plan. It's called doing things."" — Herb Kelleher, founder Southwest Airlines

Show replies by date

Stephen John Smoogen

10 Sep 10 Sep

8:24 p.m.

New subject: PROBLEM alert - Host fas03 is DOWN

On Fri, Sep 10, 2010 at 19:11, Stephen John Smoogen smooge@gmail.com wrote:

...

The fas servers seem to be going into a repeatable OOPS. At present all I can see doing is

/usr/sbin/xm destroy fasXX /usr/sbin/xm create fasXX

on their master server.

For those interested... the oops is usually

Sep 11 01:10:23 fas03 kernel: ------------[ cut here ]------------ Sep 11 01:10:23 fas03 kernel: WARNING: at block/blk-core.c:338 blk_start_queue+0x6c/0x70() (Not tainted) Sep 11 01:10:23 fas03 kernel: Modules linked in: xen_blkfront dm_mod [last unloaded: scsi_wait_scan] Sep 11 01:10:23 fas03 kernel: Pid: 0, comm: swapper Not tainted 2.6.32-44.2.el6.i686 #1 Sep 11 01:10:23 fas03 kernel: Call Trace: Sep 11 01:10:23 fas03 kernel: [<c044fc97>] ? warn_slowpath_common+0x77/0xb0 Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70 Sep 11 01:10:23 fas03 kernel: [<c044fce3>] ? warn_slowpath_null+0x13/0x20 Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70 Sep 11 01:10:23 fas03 kernel: [<ed63896b>] ? kick_pending_request_queues+0x1b/0x30 [xen_blkfront] Sep 11 01:10:23 fas03 kernel: [<ed638b80>] ? blkif_interrupt+0x200/0x220 [xen_blkfront] Sep 11 01:10:23 fas03 kernel: [<c04ad7c5>] ? handle_IRQ_event+0x45/0x140 Sep 11 01:10:23 fas03 kernel: [<c042f4b9>] ? pvclock_clocksource_read+0x169/0x190 Sep 11 01:10:23 fas03 kernel: [<c04b0b81>] ? move_native_irq+0x11/0x50 Sep 11 01:10:23 fas03 kernel: [<c04afe13>] ? handle_level_irq+0x63/0xe0 Sep 11 01:10:23 fas03 kernel: [<c040c042>] ? handle_irq+0x32/0x60 Sep 11 01:10:23 fas03 kernel: [<c066141c>] ? __xen_evtchn_do_upcall+0x12c/0x150 Sep 11 01:10:23 fas03 kernel: [<c0661475>] ? xen_evtchn_do_upcall+0x25/0x40 Sep 11 01:10:23 fas03 kernel: [<c040a57f>] ? xen_do_upcall+0x7/0xc Sep 11 01:10:23 fas03 kernel: [<c04023a7>] ? hypercall_page+0x3a7/0x1010 Sep 11 01:10:23 fas03 kernel: [<c0406b4f>] ? xen_safe_halt+0xf/0x20 Sep 11 01:10:23 fas03 kernel: [<c040470c>] ? xen_idle+0x1c/0x30 Sep 11 01:10:23 fas03 kernel: [<c0408764>] ? cpu_idle+0x94/0xd0 Sep 11 01:10:23 fas03 kernel: [<c0a5496e>] ? start_kernel+0x38d/0x392 Sep 11 01:10:23 fas03 kernel: [<c0a5441f>] ? unknown_bootoption+0x0/0x190 Sep 11 01:10:23 fas03 kernel: [<c0a57ca4>] ? xen_start_kernel+0x54e/0x554 Sep 11 01:10:23 fas03 kernel: [<c04090ad>] ? do_signal+0x39d/0xa50 Sep 11 01:10:23 fas03 kernel: ---[ end trace ef051dddccbf0b4f ]--- Sep 11 01:10:23 fas03 kernel: ------------[ cut here ]------------

Jon Masters

11 Sep 11 Sep

1:51 a.m.

New subject: PROBLEM alert - Host fas03 is DOWN

On Fri, 2010-09-10 at 19:24 -0600, Stephen John Smoogen wrote:

...

Sep 11 01:10:23 fas03 kernel: WARNING: at block/blk-core.c:338

...

Sep 11 01:10:23 fas03 kernel: [<c044fc97>] ? warn_slowpath_common+0x77/0xb0 Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70 Sep 11 01:10:23 fas03 kernel: [<c044fce3>] ? warn_slowpath_null+0x13/0x20 Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70 Sep 11 01:10:23 fas03 kernel: [<ed63896b>] ? kick_pending_request_queues+0x1b/0x30 [xen_blkfront] Sep 11 01:10:23 fas03 kernel: [<ed638b80>] ? blkif_interrupt+0x200/0x220 [xen_blkfront] Sep 11 01:10:23 fas03 kernel: [<c04ad7c5>] ? handle_IRQ_event+0x45/0x140

The code in block/blk-core:338 contains an explicit check to ensure that interrupts have been disabled, but this not true since blkif_interrupt is not registered with IRQF_DISABLED set at the time of the setup in bind_evtchn_to_irqhandler. Thus it might be that interrupts are still on when we get to kick_pending_request_queues. Does this always happen?

This perhaps happened because upstream removed IRQF_DISABLED and now runs with interrupts disabled in handle_IRQ_event, so Xen won't see this. But on 2.6.32 this change had not yet happened. It's also 2:50am and I might be reading this wrong, but I at least suggest you open a RHEL6 bug and try a more recent kernel build.

Jon.

Jon Masters

2:41 a.m.

New subject: PROBLEM alert - Host fas03 is DOWN

On Sat, 2010-09-11 at 02:51 -0400, Jon Masters wrote:

...

On Fri, 2010-09-10 at 19:24 -0600, Stephen John Smoogen wrote:

...
Sep 11 01:10:23 fas03 kernel: WARNING: at block/blk-core.c:338

...
Sep 11 01:10:23 fas03 kernel: [<c044fc97>] ? warn_slowpath_common+0x77/0xb0 Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70 Sep 11 01:10:23 fas03 kernel: [<c044fce3>] ? warn_slowpath_null+0x13/0x20 Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70 Sep 11 01:10:23 fas03 kernel: [<ed63896b>] ? kick_pending_request_queues+0x1b/0x30 [xen_blkfront] Sep 11 01:10:23 fas03 kernel: [<ed638b80>] ? blkif_interrupt+0x200/0x220 [xen_blkfront] Sep 11 01:10:23 fas03 kernel: [<c04ad7c5>] ? handle_IRQ_event+0x45/0x140

The code in block/blk-core:338 contains an explicit check to ensure that interrupts have been disabled, but this not true since blkif_interrupt is not registered with IRQF_DISABLED set at the time of the setup in bind_evtchn_to_irqhandler. Thus it might be that interrupts are still on when we get to kick_pending_request_queues. Does this always happen?

This perhaps happened because upstream removed IRQF_DISABLED and now runs with interrupts disabled in handle_IRQ_event, so Xen won't see this. But on 2.6.32 this change had not yet happened. It's also 2:50am and I might be reading this wrong, but I at least suggest you open a RHEL6 bug and try a more recent kernel build.

Ah, of course I shouldn't email before bed. There's an obvious giant spin_lock_irqsave/restore there, but as noted on xen-devel (when they were mulling over moving all of the blkif_interrupt bits into a tasklet jut a couple of weeks ago): "It looks like __blk_end_request_all...is returning with interrupts enabled sometimes". I pinged some folks.

Jon.

Mike McGrath

9:02 a.m.

New subject: PROBLEM alert - Host fas03 is DOWN

On Sat, 11 Sep 2010, Jon Masters wrote:

...

On Sat, 2010-09-11 at 02:51 -0400, Jon Masters wrote:

...
On Fri, 2010-09-10 at 19:24 -0600, Stephen John Smoogen wrote:

...
Sep 11 01:10:23 fas03 kernel: WARNING: at block/blk-core.c:338

...
Sep 11 01:10:23 fas03 kernel: [<c044fc97>] ? warn_slowpath_common+0x77/0xb0 Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70 Sep 11 01:10:23 fas03 kernel: [<c044fce3>] ? warn_slowpath_null+0x13/0x20 Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70 Sep 11 01:10:23 fas03 kernel: [<ed63896b>] ? kick_pending_request_queues+0x1b/0x30 [xen_blkfront] Sep 11 01:10:23 fas03 kernel: [<ed638b80>] ? blkif_interrupt+0x200/0x220 [xen_blkfront] Sep 11 01:10:23 fas03 kernel: [<c04ad7c5>] ? handle_IRQ_event+0x45/0x140

The code in block/blk-core:338 contains an explicit check to ensure that interrupts have been disabled, but this not true since blkif_interrupt is not registered with IRQF_DISABLED set at the time of the setup in bind_evtchn_to_irqhandler. Thus it might be that interrupts are still on when we get to kick_pending_request_queues. Does this always happen?

This perhaps happened because upstream removed IRQF_DISABLED and now runs with interrupts disabled in handle_IRQ_event, so Xen won't see this. But on 2.6.32 this change had not yet happened. It's also 2:50am and I might be reading this wrong, but I at least suggest you open a RHEL6 bug and try a more recent kernel build.

Ah, of course I shouldn't email before bed. There's an obvious giant spin_lock_irqsave/restore there, but as noted on xen-devel (when they were mulling over moving all of the blkif_interrupt bits into a tasklet jut a couple of weeks ago): "It looks like __blk_end_request_all...is returning with interrupts enabled sometimes". I pinged some folks.

Thanks for looking into this Jon, we happened to have 3 hosts die of this within about 2 hours last night. Here's the bug report Smooge opened:

https://bugzilla.redhat.com/show_bug.cgi?id=632802

I'll take a look around for a more recent RHEL6 kernel

-Mike

Mike McGrath

11:40 a.m.

New subject: PROBLEM alert - Host fas03 is DOWN

On Sat, 11 Sep 2010, Jon Masters wrote:

...

On Sat, 2010-09-11 at 02:51 -0400, Jon Masters wrote:

...
On Fri, 2010-09-10 at 19:24 -0600, Stephen John Smoogen wrote:

...
Sep 11 01:10:23 fas03 kernel: WARNING: at block/blk-core.c:338

...
Sep 11 01:10:23 fas03 kernel: [<c044fc97>] ? warn_slowpath_common+0x77/0xb0 Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70 Sep 11 01:10:23 fas03 kernel: [<c044fce3>] ? warn_slowpath_null+0x13/0x20 Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70 Sep 11 01:10:23 fas03 kernel: [<ed63896b>] ? kick_pending_request_queues+0x1b/0x30 [xen_blkfront] Sep 11 01:10:23 fas03 kernel: [<ed638b80>] ? blkif_interrupt+0x200/0x220 [xen_blkfront] Sep 11 01:10:23 fas03 kernel: [<c04ad7c5>] ? handle_IRQ_event+0x45/0x140

The code in block/blk-core:338 contains an explicit check to ensure that interrupts have been disabled, but this not true since blkif_interrupt is not registered with IRQF_DISABLED set at the time of the setup in bind_evtchn_to_irqhandler. Thus it might be that interrupts are still on when we get to kick_pending_request_queues. Does this always happen?

This perhaps happened because upstream removed IRQF_DISABLED and now runs with interrupts disabled in handle_IRQ_event, so Xen won't see this. But on 2.6.32 this change had not yet happened. It's also 2:50am and I might be reading this wrong, but I at least suggest you open a RHEL6 bug and try a more recent kernel build.

Ah, of course I shouldn't email before bed. There's an obvious giant spin_lock_irqsave/restore there, but as noted on xen-devel (when they were mulling over moving all of the blkif_interrupt bits into a tasklet jut a couple of weeks ago): "It looks like __blk_end_request_all...is returning with interrupts enabled sometimes". I pinged some folks.

Just so everyone else knows, I've set kernel.panic to 10 on these hosts so at least they'll reboot when they panic. Hopefully we can avoid a few wake-and-reboot issues like we had last night :-/

-Mike

Jon Masters

12:12 p.m.

New subject: PROBLEM alert - Host fas03 is DOWN

On Sat, 2010-09-11 at 11:40 -0500, Mike McGrath wrote:

...

On Sat, 11 Sep 2010, Jon Masters wrote:

...

...
...
The code in block/blk-core:338 contains an explicit check to ensure that interrupts have been disabled, but this not true since blkif_interrupt is not registered with IRQF_DISABLED set at the time of the setup in bind_evtchn_to_irqhandler. Thus it might be that interrupts are still on when we get to kick_pending_request_queues. Does this always happen?

This perhaps happened because upstream removed IRQF_DISABLED and now runs with interrupts disabled in handle_IRQ_event, so Xen won't see this. But on 2.6.32 this change had not yet happened. It's also 2:50am and I might be reading this wrong, but I at least suggest you open a RHEL6 bug and try a more recent kernel build.

...

...
Ah, of course I shouldn't email before bed. There's an obvious giant spin_lock_irqsave/restore there, but as noted on xen-devel (when they were mulling over moving all of the blkif_interrupt bits into a tasklet jut a couple of weeks ago): "It looks like __blk_end_request_all...is returning with interrupts enabled sometimes". I pinged some folks.

...

Just so everyone else knows, I've set kernel.panic to 10 on these hosts so at least they'll reboot when they panic. Hopefully we can avoid a few wake-and-reboot issues like we had last night :-/

I pinged some folks about it last night. I would hope there will be a fix for that soon. I suspect it's reproducible on the 70+ kernels, but can you check that for us and update the BZ?

Jon.

Stephen John Smoogen

6:09 p.m.

New subject: PROBLEM alert - Host fas03 is DOWN

On Sat, Sep 11, 2010 at 11:12, Jon Masters jcm@redhat.com wrote:

...

On Sat, 2010-09-11 at 11:40 -0500, Mike McGrath wrote:

...
On Sat, 11 Sep 2010, Jon Masters wrote:

...
...
...
The code in block/blk-core:338 contains an explicit check to ensure that interrupts have been disabled, but this not true since blkif_interrupt is not registered with IRQF_DISABLED set at the time of the setup in bind_evtchn_to_irqhandler. Thus it might be that interrupts are still on when we get to kick_pending_request_queues. Does this always happen?

This perhaps happened because upstream removed IRQF_DISABLED and now runs with interrupts disabled in handle_IRQ_event, so Xen won't see this. But on 2.6.32 this change had not yet happened. It's also 2:50am and I might be reading this wrong, but I at least suggest you open a RHEL6 bug and try a more recent kernel build.

...
...
Ah, of course I shouldn't email before bed. There's an obvious giant spin_lock_irqsave/restore there, but as noted on xen-devel (when they were mulling over moving all of the blkif_interrupt bits into a tasklet jut a couple of weeks ago): "It looks like __blk_end_request_all...is returning with interrupts enabled sometimes". I pinged some folks.

...
Just so everyone else knows, I've set kernel.panic to 10 on these hosts so at least they'll reboot when they panic. Hopefully we can avoid a few wake-and-reboot issues like we had last night :-/

I pinged some folks about it last night. I would hope there will be a fix for that soon. I suspect it's reproducible on the 70+ kernels, but can you check that for us and update the BZ?

I have fas3 on a .71 kernel. Since they seem to occur at the same time I have kept the others at older versions to see if it fixes or misses. fas02 will reboot into a .71 if it needs to. I haven't done anything to fas01 to keep it prime test grounds.

Jon Masters

7:14 p.m.

New subject: PROBLEM alert - Host fas03 is DOWN

On Sat, 2010-09-11 at 17:09 -0600, Stephen John Smoogen wrote:

...

On Sat, Sep 11, 2010 at 11:12, Jon Masters jcm@redhat.com wrote:

...
On Sat, 2010-09-11 at 11:40 -0500, Mike McGrath wrote:

...
On Sat, 11 Sep 2010, Jon Masters wrote:

...
...
...
The code in block/blk-core:338 contains an explicit check to ensure that interrupts have been disabled, but this not true since blkif_interrupt is not registered with IRQF_DISABLED set at the time of the setup in bind_evtchn_to_irqhandler. Thus it might be that interrupts are still on when we get to kick_pending_request_queues. Does this always happen?

This perhaps happened because upstream removed IRQF_DISABLED and now runs with interrupts disabled in handle_IRQ_event, so Xen won't see this. But on 2.6.32 this change had not yet happened. It's also 2:50am and I might be reading this wrong, but I at least suggest you open a RHEL6 bug and try a more recent kernel build.

...
...
Ah, of course I shouldn't email before bed. There's an obvious giant spin_lock_irqsave/restore there, but as noted on xen-devel (when they were mulling over moving all of the blkif_interrupt bits into a tasklet jut a couple of weeks ago): "It looks like __blk_end_request_all...is returning with interrupts enabled sometimes". I pinged some folks.

...
Just so everyone else knows, I've set kernel.panic to 10 on these hosts so at least they'll reboot when they panic. Hopefully we can avoid a few wake-and-reboot issues like we had last night :-/

I pinged some folks about it last night. I would hope there will be a fix for that soon. I suspect it's reproducible on the 70+ kernels, but can you check that for us and update the BZ?

...

I have fas3 on a .71 kernel. Since they seem to occur at the same time I have kept the others at older versions to see if it fixes or misses. fas02 will reboot into a .71 if it needs to. I haven't done anything to fas01 to keep it prime test grounds.

Well, it makes sense that they'd fire at the same time. There's clearly some underlying IO path that causes the return with interrupts still on - perhaps an error path, who knows, I will let others poke or find some time to dig perhaps next week ;)

Jon.

Jon Masters

12 Sep 12 Sep

10:46 a.m.

New subject: PROBLEM alert - Host fas03 is DOWN

On Sat, 2010-09-11 at 11:40 -0500, Mike McGrath wrote:

...

On Sat, 11 Sep 2010, Jon Masters wrote:

...
On Sat, 2010-09-11 at 02:51 -0400, Jon Masters wrote:

...
On Fri, 2010-09-10 at 19:24 -0600, Stephen John Smoogen wrote:

...
Sep 11 01:10:23 fas03 kernel: WARNING: at block/blk-core.c:338

...
Sep 11 01:10:23 fas03 kernel: [<c044fc97>] ? warn_slowpath_common+0x77/0xb0 Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70 Sep 11 01:10:23 fas03 kernel: [<c044fce3>] ? warn_slowpath_null+0x13/0x20 Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70 Sep 11 01:10:23 fas03 kernel: [<ed63896b>] ? kick_pending_request_queues+0x1b/0x30 [xen_blkfront] Sep 11 01:10:23 fas03 kernel: [<ed638b80>] ? blkif_interrupt+0x200/0x220 [xen_blkfront] Sep 11 01:10:23 fas03 kernel: [<c04ad7c5>] ? handle_IRQ_event+0x45/0x140

The code in block/blk-core:338 contains an explicit check to ensure that interrupts have been disabled, but this not true since blkif_interrupt is not registered with IRQF_DISABLED set at the time of the setup in bind_evtchn_to_irqhandler. Thus it might be that interrupts are still on when we get to kick_pending_request_queues. Does this always happen?

This perhaps happened because upstream removed IRQF_DISABLED and now runs with interrupts disabled in handle_IRQ_event, so Xen won't see this. But on 2.6.32 this change had not yet happened. It's also 2:50am and I might be reading this wrong, but I at least suggest you open a RHEL6 bug and try a more recent kernel build.

Ah, of course I shouldn't email before bed. There's an obvious giant spin_lock_irqsave/restore there, but as noted on xen-devel (when they were mulling over moving all of the blkif_interrupt bits into a tasklet jut a couple of weeks ago): "It looks like __blk_end_request_all...is returning with interrupts enabled sometimes". I pinged some folks.

Just so everyone else knows, I've set kernel.panic to 10 on these hosts so at least they'll reboot when they panic. Hopefully we can avoid a few wake-and-reboot issues like we had last night :-/

Mike, is there any chance you could boot the -debug kernel on one of these affected systems? Also, can you let us know about the host?

Jon.

Stephen John Smoogen

11:12 a.m.

New subject: PROBLEM alert - Host fas03 is DOWN

On Sun, Sep 12, 2010 at 09:46, Jon Masters jonathan@jonmasters.org wrote:

...

On Sat, 2010-09-11 at 11:40 -0500, Mike McGrath wrote:

...
On Sat, 11 Sep 2010, Jon Masters wrote:

...
On Sat, 2010-09-11 at 02:51 -0400, Jon Masters wrote:

...
On Fri, 2010-09-10 at 19:24 -0600, Stephen John Smoogen wrote:

...
Sep 11 01:10:23 fas03 kernel: WARNING: at block/blk-core.c:338

...
Sep 11 01:10:23 fas03 kernel: [<c044fc97>] ? warn_slowpath_common+0x77/0xb0 Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70 Sep 11 01:10:23 fas03 kernel: [<c044fce3>] ? warn_slowpath_null+0x13/0x20 Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70 Sep 11 01:10:23 fas03 kernel: [<ed63896b>] ? kick_pending_request_queues+0x1b/0x30 [xen_blkfront] Sep 11 01:10:23 fas03 kernel: [<ed638b80>] ? blkif_interrupt+0x200/0x220 [xen_blkfront] Sep 11 01:10:23 fas03 kernel: [<c04ad7c5>] ? handle_IRQ_event+0x45/0x140

The code in block/blk-core:338 contains an explicit check to ensure that interrupts have been disabled, but this not true since blkif_interrupt is not registered with IRQF_DISABLED set at the time of the setup in bind_evtchn_to_irqhandler. Thus it might be that interrupts are still on when we get to kick_pending_request_queues. Does this always happen?

This perhaps happened because upstream removed IRQF_DISABLED and now runs with interrupts disabled in handle_IRQ_event, so Xen won't see this. But on 2.6.32 this change had not yet happened. It's also 2:50am and I might be reading this wrong, but I at least suggest you open a RHEL6 bug and try a more recent kernel build.

Ah, of course I shouldn't email before bed. There's an obvious giant spin_lock_irqsave/restore there, but as noted on xen-devel (when they were mulling over moving all of the blkif_interrupt bits into a tasklet jut a couple of weeks ago): "It looks like __blk_end_request_all...is returning with interrupts enabled sometimes". I pinged some folks.

Just so everyone else knows, I've set kernel.panic to 10 on these hosts so at least they'll reboot when they panic. Hopefully we can avoid a few wake-and-reboot issues like we had last night :-/

Mike, is there any chance you could boot the -debug kernel on one of these affected systems? Also, can you let us know about the host?

kernel.panic set to 10 did not reboot the systems. What and where is a debug kernel?

Jon Masters

11:21 a.m.

New subject: PROBLEM alert - Host fas03 is DOWN

On Sun, 2010-09-12 at 10:12 -0600, Stephen John Smoogen wrote:

...

On Sun, Sep 12, 2010 at 09:46, Jon Masters jonathan@jonmasters.org wrote:

...
On Sat, 2010-09-11 at 11:40 -0500, Mike McGrath wrote:

...
On Sat, 11 Sep 2010, Jon Masters wrote:

...
On Sat, 2010-09-11 at 02:51 -0400, Jon Masters wrote:

...
On Fri, 2010-09-10 at 19:24 -0600, Stephen John Smoogen wrote:

...
Sep 11 01:10:23 fas03 kernel: WARNING: at block/blk-core.c:338

...
Sep 11 01:10:23 fas03 kernel: [<c044fc97>] ? warn_slowpath_common+0x77/0xb0 Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70 Sep 11 01:10:23 fas03 kernel: [<c044fce3>] ? warn_slowpath_null+0x13/0x20 Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70 Sep 11 01:10:23 fas03 kernel: [<ed63896b>] ? kick_pending_request_queues+0x1b/0x30 [xen_blkfront] Sep 11 01:10:23 fas03 kernel: [<ed638b80>] ? blkif_interrupt+0x200/0x220 [xen_blkfront] Sep 11 01:10:23 fas03 kernel: [<c04ad7c5>] ? handle_IRQ_event+0x45/0x140

The code in block/blk-core:338 contains an explicit check to ensure that interrupts have been disabled, but this not true since blkif_interrupt is not registered with IRQF_DISABLED set at the time of the setup in bind_evtchn_to_irqhandler. Thus it might be that interrupts are still on when we get to kick_pending_request_queues. Does this always happen?

This perhaps happened because upstream removed IRQF_DISABLED and now runs with interrupts disabled in handle_IRQ_event, so Xen won't see this. But on 2.6.32 this change had not yet happened. It's also 2:50am and I might be reading this wrong, but I at least suggest you open a RHEL6 bug and try a more recent kernel build.

Ah, of course I shouldn't email before bed. There's an obvious giant spin_lock_irqsave/restore there, but as noted on xen-devel (when they were mulling over moving all of the blkif_interrupt bits into a tasklet jut a couple of weeks ago): "It looks like __blk_end_request_all...is returning with interrupts enabled sometimes". I pinged some folks.

Just so everyone else knows, I've set kernel.panic to 10 on these hosts so at least they'll reboot when they panic. Hopefully we can avoid a few wake-and-reboot issues like we had last night :-/

Mike, is there any chance you could boot the -debug kernel on one of these affected systems? Also, can you let us know about the host?

kernel.panic set to 10 did not reboot the systems. What and where is a debug kernel?

I'm not sure where you get them externally. But internally, if you go to brewweb.devel you will see for the kernel package that there are variants like "kernel-debug". Please install that one, since it has lots of extra debugging options turned on. It'll run more slowly, but I doubt it will be noticeable (and the system is already crashing, so...).

Then make sure you have all of the logs going somewhere useful. Do you have any (virtual) serial console setup that you are using to capture the panic output and from which you could capture kernel messages if you set the console loglevel appropriately? Do you have the ability to install another guest on the host system that could be used for debugging this problem? (assuming it is always reproducible)?

Also, please do give me some info on the host system, etc. I am not necessarily going to have time to fix this myself, but I am attempting to ensure that all of the necessary data is at least available tomorrow.

Jon.

Stephen John Smoogen

12:04 p.m.

New subject: PROBLEM alert - Host fas03 is DOWN

On Sun, Sep 12, 2010 at 10:21, Jon Masters jonathan@jonmasters.org wrote:

...

On Sun, 2010-09-12 at 10:12 -0600, Stephen John Smoogen wrote:

...
On Sun, Sep 12, 2010 at 09:46, Jon Masters jonathan@jonmasters.org wrote:

...
On Sat, 2010-09-11 at 11:40 -0500, Mike McGrath wrote:

...
On Sat, 11 Sep 2010, Jon Masters wrote:

...

I'm not sure where you get them externally. But internally, if you go to brewweb.devel you will see for the kernel package that there are variants like "kernel-debug". Please install that one, since it has lots of extra debugging options turned on. It'll run more slowly, but I doubt it will be noticeable (and the system is already crashing, so...).

Then make sure you have all of the logs going somewhere useful. Do you have any (virtual) serial console setup that you are using to capture the panic output and from which you could capture kernel messages if you set the console loglevel appropriately? Do you have the ability to install another guest on the host system that could be used for debugging this problem? (assuming it is always reproducible)?

Also, please do give me some info on the host system, etc. I am not necessarily going to have time to fix this myself, but I am attempting to ensure that all of the necessary data is at least available tomorrow.

Ok I can log in via serial in a screen and 'log' output from there. Currently only fas01 is running the old kernel. The two other systems are running new kernels and have not 'rebooted' yet. I will see if I can get a debug for that one.

4982

Age (days ago)

4983

Last active (days ago)

infrastructure@lists.fedoraproject.org

12 comments

4 participants

tags (0)

participants (4)

Jon Masters
Jon Masters
Mike McGrath
Stephen John Smoogen