On 10.02.2014 11:06, Thomas Gleixner wrote:
On Mon, 10 Feb 2014, poma wrote:
[ 83.558551] [<ffffffff81025b17>] amd_e400_idle+0x87/0x130
So this seems to happen only on AMD machines which use that e400 idle mode. I have no idea at the moment whats wrong there. I'll find one of those machines and try to reproduce.
Thanks,
tglx
Thanks for your response! :) https://bugzilla.redhat.com/show_bug.cgi?id=1031296#c24
poma
On Mon, Feb 10, 2014 at 07:59:39PM +0100, poma wrote:
On 10.02.2014 11:06, Thomas Gleixner wrote:
On Mon, 10 Feb 2014, poma wrote:
[ 83.558551] [<ffffffff81025b17>] amd_e400_idle+0x87/0x130
So this seems to happen only on AMD machines which use that e400 idle mode. I have no idea at the moment whats wrong there. I'll find one of those machines and try to reproduce.
I tried to debug that warn as well. Even if I found machine with proper family and model number, HW C1E bug do not happen there, hence I just hack kernel to always use amd_e400_idle (and remove AMD rdmsr specific instructions to do not crash). That make issue 100% reproducible when suspend/resume.
It happens when cpu become idle, call CLOCK_EVT_NOTIFY_BROADCAST_ENTER, but before CLOCK_EVT_NOTIFY_BROADCAST_EXIT, interrupt trigger on that cpu. IRQ is handled by hrtimer code, which want to switch to hres and call:
tick_switch_to_oneshot() -> ... -> tick_broadcast_setup_oneshot()
Since we have already proper handler there, last procedure clear tick_broadcast_oneshot_mask, but tick_broadcast_pending_mask stay set. When amd_e400_idle next time call CLOCK_EVT_NOTIFY_BROADCAST_ENTER, the warning will happen.
I came with a below patch, which also clear pending mask, but perhaps oneshot_mask should not be cleared on tick_broadcast_setup_oneshot(), or should be cleared only conditionally, or some other solution is needed. Anyway, patch make the warning gone on my hacked setup, I was waiting for testing results on real C1E hardware.
Thanks Stanislaw
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index 43780ab..98977a5 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -756,6 +756,7 @@ out: static void tick_broadcast_clear_oneshot(int cpu) { cpumask_clear_cpu(cpu, tick_broadcast_oneshot_mask); + cpumask_clear_cpu(cpu, tick_broadcast_pending_mask); }
static void tick_broadcast_init_next_event(struct cpumask *mask,
On Tue, 11 Feb 2014, Stanislaw Gruszka wrote:
On Mon, Feb 10, 2014 at 07:59:39PM +0100, poma wrote:
On 10.02.2014 11:06, Thomas Gleixner wrote:
On Mon, 10 Feb 2014, poma wrote:
[ 83.558551] [<ffffffff81025b17>] amd_e400_idle+0x87/0x130
So this seems to happen only on AMD machines which use that e400 idle mode. I have no idea at the moment whats wrong there. I'll find one of those machines and try to reproduce.
I tried to debug that warn as well. Even if I found machine with proper family and model number, HW C1E bug do not happen there, hence I just hack kernel to always use amd_e400_idle (and remove AMD rdmsr specific instructions to do not crash). That make issue 100% reproducible when suspend/resume.
It's also reproducible on cpu online/offline.
It happens when cpu become idle, call CLOCK_EVT_NOTIFY_BROADCAST_ENTER, but before CLOCK_EVT_NOTIFY_BROADCAST_EXIT, interrupt trigger on that cpu. IRQ is handled by hrtimer code, which want to switch to hres and call:
tick_switch_to_oneshot() -> ... -> tick_broadcast_setup_oneshot()
Since we have already proper handler there, last procedure clear tick_broadcast_oneshot_mask, but tick_broadcast_pending_mask stay set. When amd_e400_idle next time call CLOCK_EVT_NOTIFY_BROADCAST_ENTER, the warning will happen.
I came with a below patch, which also clear pending mask, but perhaps
Fun. I came up with the exact same solution independent of you and I tested it on real C1E contaminated hardware.
oneshot_mask should not be cleared on tick_broadcast_setup_oneshot(), or should be cleared only conditionally, or some other solution is
We can do it unconditionally. It creates consistent state in all corner cases.
There are other solutions to the problem, but that needs a major rework of the broadcast code. I so wish that this mess would have never been necessary at all ...
Thanks,
tglx
I came with a below patch, which also clear pending mask, but perhaps
Fun. I came up with the exact same solution independent of you and I tested it on real C1E contaminated hardware.
oneshot_mask should not be cleared on tick_broadcast_setup_oneshot(), or should be cleared only conditionally, or some other solution is
We can do it unconditionally. It creates consistent state in all corner cases.
There are other solutions to the problem, but that needs a major rework of the broadcast code. I so wish that this mess would have never been necessary at all ...
Thomas, please post/apply patch, which you think is the most appropriate.
Thanks Stanislaw
kernel@lists.fedoraproject.org