I reproduced what jwb reported on the powerstation. Mine is with F10 + updates userland, only the kernel seems to matter. The test case is:
# modprobe iscsi_tcp Illegal Instruction #
On -16[278], same oops that jwb saw, wrong text appearing at a page boundary.
This kernel: http://kojipkgs.fedoraproject.org/scratch/roland/task_1396640/kernel-vanilla...
does not exhibit the problem. That should be all the same buildroot stuff, and 2.6.29.4 with no extra patches.
OTOH, this kernel: http://kojipkgs.fedoraproject.org/scratch/roland/task_1396192/kernel-2.6.29....
also does not exibit the problem. That is normal -167 with all the same patches, but built in dist-f10-updates-candidate buildroots.
But contrary to jwb's reports: On my powerstation 2.6.29.3-159.fc11.ppc64 fails to boot:
Loading ipr moduipr 0001:01:01.0: IOA initialized. le scsi0 : IBM 572C Storage Adapter scsi 0:0:0:0: Direct-Access IBM-ESXS ST373455SS BA23 PQ: 0 ANSI: 5 Oops: Exception in kernel mode, sig: 4 [#1] SMP NR_CPUS=128 NUMA Maple Modules linked in: ipr(+) NIP: c000000000400000 LR: c0000000003ffff8 CTR: c00000000044304c REGS: c0000000781a7620 TRAP: 0700 Not tainted (2.6.29.3-159.fc11.ppc64) MSR: 9000000000089032 <EE,ME,IR,DR> CR: 24000024 XER: 000fffff TASK = c000000078193500[70] 'scsi_scan_0' THREAD: c0000000781a4000 CPU: 2 GPR00: c0000000003ffff8 c0000000781a78a0 c000000000e6c210 c000000000db0390 GPR04: 0000000000000058 c00000007be6e000 000000000000000a ffffffffffffffff GPR08: 0000000000000000 c0000000781a4000 c00000007be6e618 0000000000000000 GPR12: 0000000044000028 c000000000ea2800 000000000021a344 000000000021a394 GPR16: 000000000021a368 0000000000000000 c00000007be6e000 c0000000781a7b50 GPR20: c000000078090800 0000000000000000 0000000000000000 c00000007be6e060 GPR24: c000000078090828 0000000000000000 0000000000000000 fffffffffffffffa GPR28: c000000078400120 0000000000000000 c000000000e0d710 c0000000781a7810 NIP [c000000000400000] .transport_destroy_device+0x3c/0x50 LR [c0000000003ffff8] .transport_destroy_device+0x34/0x50 Call Trace: [c0000000781a78a0] [c000000000446c9c] .scsi_alloc_sdev+0x220/0x284 (unreliable) [c0000000781a7950] [c000000000446f8c] .scsi_probe_and_add_lun+0x168/0xda8 [c0000000781a7ad0] [c000000000447fb0] .__scsi_scan_target+0x104/0x730 [c0000000781a7c20] [c000000000448654] .scsi_scan_channel+0x78/0xe8 [c0000000781a7ce0] [c0000000004489d4] .scsi_scan_host_selected+0x11c/0x1a8 [c0000000781a7da0] [c000000000448b48] .do_scsi_scan_host+0xe8/0x10c [c0000000781a7e40] [c000000000448bb0] .do_scan_async+0x44/0x220 [c0000000781a7ef0] [c0000000000c57f8] .kthread+0x90/0xe4 [c0000000781a7f90] [c00000000002f730] .kernel_thread+0x54/0x70 Instruction dump: fbe1fff8 f821ff71 7c3f0b78 ebc2d0f0 f87f0070 60000000 60000000 e87f0070 e89e8000 4bfff8cd 60000000 383f0090 <00001010> 00000008 00001013 0000000f ---[ end trace 998c8eb2fd0a3b41 ]--- usb 2-1.4: new full speed USB device using ohci_hcd and address 3 usb 2-1.4: New USB device found, idVendor=05ac, idProduct=1003
It prints some more usb probe msgs, but nothing else. I can then reboot it with Ctl-Alt-Del on the USB keyboard.
This is obviously a variant of the same problem. It's losing on clobbered instructions at a page boundary.
Same for -158. Same for -157. Same for -155. Same for -154.
Man but these bastards boot slow.
Same for -152.
-142 does not do that, nor exhibit the problem in "modprobe iscsi_tcp".
buildroot diffs -142 vs -152:
@@ -23,8 +23,8 @@ device-mapper-1.02.31-4.fc11.ppc64 device-mapper-libs-1.02.31-4.fc11.ppc64 diffutils-2.8.1-23.fc11.ppc64 -e2fsprogs-1.41.4-8.fc11.ppc64 -e2fsprogs-libs-1.41.4-8.fc11.ppc64 +e2fsprogs-1.41.4-9.fc11.ppc64 +e2fsprogs-libs-1.41.4-9.fc11.ppc64 elfutils-0.140-2.fc11.ppc64 elfutils-libelf-0.140-2.fc11.ppc64 elfutils-libs-0.140-2.fc11.ppc64 @@ -89,7 +89,7 @@ nss-3.12.3-4.fc11.ppc64 nss-softokn-freebl-3.12.3-4.fc11.ppc64 openldap-2.4.15-3.fc11.ppc64 -openssl-0.9.8k-1.fc11.ppc64 +openssl-0.9.8k-4.fc11.ppc64 pam-1.0.91-6.fc11.ppc64 patch-2.5.4-38.fc11.ppc64 pcre-7.8-2.fc11.ppc64 @@ -102,7 +102,7 @@ pkgconfig-0.23-8.fc11.ppc64 policycoreutils-2.0.62-12.2.fc11.ppc64 popt-1.13-5.fc11.ppc64 -ppl-0.10.1-1.fc11.ppc64 +ppl-0.10.2-2.fc11.ppc64 procps-3.2.7-27.fc11.ppc64 psmisc-22.6-9.fc11.ppc64 readline-5.2-14.fc11.ppc64
ppl is a library used by gcc. I eyeballed 0.10.1->0.10.2 changes and I doubt it's involved (AFAIK no actual code changes there).
http://koji.fedoraproject.org/koji/taskinfo?taskID=1396771 is a scratch build of -142 in the current buildroots. I'll go to sleep before it finishes, try it tomorrow.
I also tried: koji build --nowait --scratch --arch-override=ppc64 --repo-id=72872 dist-f11 'cvs://cvs.fedoraproject.org/cvs/pkgs?rpms/kernel/F-11#kernel-2_6_29_3-152_fc11'
but: BuildError: Bad repo: 72872 (DELETED)
That would have been -152 but built in the buildroot -142 was built in. I guess we'd need a temporary tag or something to recover the right repo to do that build in koji.
lbr () { koji rpminfo $1 | awk '$1 == "Buildroot:" { print $2 }' | xargs -n1 koji list-buildroot }
diff -u <(lbr kernel-2.6.29.3-142.fc11.ppc64) <(koji list-buildroot 470988)
for the buildroot diffs from -142 to the scratch -142 (current buildroot).
Now you know what I know and we still know nothing.
Oh, and note the two variant crashes in different kernels are in different routines in different builds, but always at PC 0xc000000000400000, and always clobbered the next few words with: 00001010 00000008 00001013 0000000f
The magic PAGE_OFFSET+4MB effect. So, youse gots to wonder, and...
On 2.6.29.3-142.fc11.ppc64, which has "no problem", I built the appended module. It printed this:
Instruction dump: e8090000 f8410028 7f83e378 e9690010 7fa5eb78 7c0903a6 e8490008 4e800421 <00001010> 00000008 00001013 0000000f 7961626f 6f740000 00101600 00000c00 ^^^^^^^^ ^^^^ <-- spells "yaboot" 00000400 00101100 00000800 7fa3eb78 4bfff24d 60000000 38600000 383f00b0 ^^^^^^^^ <-- goes to correct text again from here
The magic 44 bytes of bogon at PAGE_OFFSET+4MB effect. We have no idea how long we have been screwed.
I updated to yaboot-1.3.14-12.fc11.ppc (was f10), ran ybin, no help.
So long and thanks for all the geese, Roland
====== #include <linux/module.h> #include <asm/ptrace.h> #include <asm/uaccess.h>
MODULE_DESCRIPTION("fmh"); MODULE_LICENSE("GPL");
int ninsn = 24; module_param(ninsn, int, 0);
static void __exit exit_fmh(void) { }
static int __init init_fmh(void) { int i; unsigned long pc = 0xc000000000400000 - 8*4;
printk(KERN_ERR "Instruction dump:");
for (i = 0; i < ninsn; i++) { int instr;
if (!(i % 8)) printk("\n");
/* We use __get_user here *only* to avoid an OOPS on a * bad address because the pc *should* only be a * kernel address. */ if ( __get_user(instr, (unsigned int __user *)pc)) { printk("XXXXXXXX "); } else { if (0xc000000000400000 == pc) printk("<%08x> ", instr); else printk("%08x ", instr); }
pc += sizeof(int); }
printk("\n");
return -EAGAIN; }
module_init(init_fmh); module_exit(exit_fmh);
On Sat, Jun 06, 2009 at 05:27:49AM -0700, Roland McGrath wrote:
I reproduced what jwb reported on the powerstation. Mine is with F10 + updates userland, only the kernel seems to matter. The test case is:
# modprobe iscsi_tcp Illegal Instruction #
On -16[278], same oops that jwb saw, wrong text appearing at a page boundary.
This kernel: http://kojipkgs.fedoraproject.org/scratch/roland/task_1396640/kernel-vanilla...
does not exhibit the problem. That should be all the same buildroot stuff, and 2.6.29.4 with no extra patches.
OTOH, this kernel: http://kojipkgs.fedoraproject.org/scratch/roland/task_1396192/kernel-2.6.29....
also does not exibit the problem. That is normal -167 with all the same patches, but built in dist-f10-updates-candidate buildroots.
But contrary to jwb's reports: On my powerstation 2.6.29.3-159.fc11.ppc64 fails to boot:
That's not contrary. We were testing on different machines. I was testing on a Apple PowerMac7,2 (dual ppc970 G5) which uses sata_swv for storage, not ipr.
This is obviously a variant of the same problem.
Right.
It's losing on clobbered instructions at a page boundary.
Yes, seems so.
Man but these bastards boot slow.
I've noticed that about the powerstation, yes. The G5 boots surprisingly quick with F11. Go figure.
Oh, and note the two variant crashes in different kernels are in different routines in different builds, but always at PC 0xc000000000400000, and always clobbered the next few words with: 00001010 00000008 00001013 0000000f
The magic PAGE_OFFSET+4MB effect. So, youse gots to wonder, and...
On 2.6.29.3-142.fc11.ppc64, which has "no problem", I built the appended module. It printed this:
Instruction dump: e8090000 f8410028 7f83e378 e9690010 7fa5eb78 7c0903a6 e8490008 4e800421 <00001010> 00000008 00001013 0000000f 7961626f 6f740000 00101600 00000c00 ^^^^^^^^ ^^^^ <-- spells "yaboot" 00000400 00101100 00000800 7fa3eb78 4bfff24d 60000000 38600000 383f00b0 ^^^^^^^^ <-- goes to correct text again from here
The magic 44 bytes of bogon at PAGE_OFFSET+4MB effect. We have no idea how long we have been screwed.
I updated to yaboot-1.3.14-12.fc11.ppc (was f10), ran ybin, no help.
ybin isn't needed on the powerstation iirc. Anyway, that is indeed odd.
We should have Tony take a look at this if possible. Or if David can remeber how to do a netboot directly from OF (and skipping yaboot), that would be a good test too.
josh
On Sat, 2009-06-06 at 09:54 -0400, Josh Boyer wrote:
ybin isn't needed on the powerstation iirc. Anyway, that is indeed odd.
We should have Tony take a look at this if possible. Or if David can remeber how to do a netboot directly from OF (and skipping yaboot), that would be a good test too.
/usr/sbin/wrapper -o zImage /boot/vmlinuz-2.6.29.4-167.fc11.ppc64 \ -i /boot/initrd-2.6.29.4-167.fc11.ppc64.img
Give resulting zImage to OpenFirmware.
Various versions of OF have different bugs with that (image size, etc.) but I think the PowerStation ought to be fine.
You could also try using kexec -- that should help eliminate yaboot bugs too.
On Sat, 2009-06-06 at 15:29 +0100, David Woodhouse wrote:
You could also try using kexec -- that should help eliminate yaboot bugs too.
Booting with kexec (after rebuilding it because for some reason we're shipping a ppc32-capable kexec again, gr) shows that the corruption has gone away:
Instruction dump: fbbd0000 fbbd0008 48224d75 60000000 eb9e8000 7f83e378 48227829 60000000 <7fa3eb78> e89c0020 38bc0018 4beecb21 60000000 7f83e378 4822715d 60000000 38600000 383f0090 e8010010 7c0803a6 eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8
I had been seeing it before: fbbd0000 fbbd0008 48224d75 60000000 eb9e8000 7f83e378 48227829 60000000 <00001010> 00000008 00001013 0000000f 7961626f 6f740000 00101600 00000c00 00000200 00101100 00000810 7c0803a6 eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8
I blame yaboot...
On Sat, 2009-06-06 at 16:13 +0100, David Woodhouse wrote:
On Sat, 2009-06-06 at 15:29 +0100, David Woodhouse wrote:
You could also try using kexec -- that should help eliminate yaboot bugs too.
Booting with kexec (after rebuilding it because for some reason we're shipping a ppc32-capable kexec again, gr)
I needed this too, btw:
--- kexec/arch/ppc64/kexec-elf-rel-ppc64.c.orig 2009-06-06 16:27:10.000000000 +0100 +++ kexec/arch/ppc64/kexec-elf-rel-ppc64.c 2009-06-06 16:08:37.000000000 +0100 @@ -88,6 +88,11 @@ void machine_apply_elf_rel(struct mem_eh | (value & 0x03fffffc); break;
+ case R_PPC64_REL32: + /* Convert value to relative */ + *(uint32_t *)location = value - address; + break; + case R_PPC64_ADDR16_LO: *(uint16_t *)location = value & 0xffff; break;
I blame yaboot...
I note that yaboot doesn't actually do any relocations when it loads the relocatable kernel, while kexec does. Should it?
On Sat, 2009-06-06 at 16:13 +0100, David Woodhouse wrote:
I blame yaboot...
Fixed in yaboot-1.3.14-13 (thanks to benh for pointing out the problem).
We missed the boat to get that in F11 GA, right?
On Sun, Jun 07, 2009 at 09:28:25AM +0100, David Woodhouse wrote:
On Sat, 2009-06-06 at 16:13 +0100, David Woodhouse wrote:
I blame yaboot...
Fixed in yaboot-1.3.14-13 (thanks to benh for pointing out the problem).
We missed the boat to get that in F11 GA, right?
Pretty sure. Jesse is getting on a plane today, and we release Tuesday.
0-day update it seems. Though that isn't going to help the installer any. We might have to recommend yum updating or pre-upgrade for the machines this impacted.
josh
On Sun, 7 Jun 2009, Josh Boyer wrote:
On Sun, Jun 07, 2009 at 09:28:25AM +0100, David Woodhouse wrote:
On Sat, 2009-06-06 at 16:13 +0100, David Woodhouse wrote:
I blame yaboot...
Fixed in yaboot-1.3.14-13 (thanks to benh for pointing out the problem).
We missed the boat to get that in F11 GA, right?
Pretty sure. Jesse is getting on a plane today, and we release Tuesday.
0-day update it seems. Though that isn't going to help the installer any. We might have to recommend yum updating or pre-upgrade for the machines this impacted.
Or 'netboot' with the zImage, which can be done from the CD.
Precisely where in the release kernel does the 4MiB corruption happen?
On Sun, Jun 07, 2009 at 12:57:06PM +0100, David Woodhouse wrote:
On Sun, 7 Jun 2009, Josh Boyer wrote:
On Sun, Jun 07, 2009 at 09:28:25AM +0100, David Woodhouse wrote:
On Sat, 2009-06-06 at 16:13 +0100, David Woodhouse wrote:
I blame yaboot...
Fixed in yaboot-1.3.14-13 (thanks to benh for pointing out the problem).
We missed the boat to get that in F11 GA, right?
Pretty sure. Jesse is getting on a plane today, and we release Tuesday.
0-day update it seems. Though that isn't going to help the installer any. We might have to recommend yum updating or pre-upgrade for the machines this impacted.
Or 'netboot' with the zImage, which can be done from the CD.
Precisely where in the release kernel does the 4MiB corruption happen?
drivers/scsi/scsi_transport_iscsi.c, which is unconditionally hit (via iscsi_tcp) by anaconda.
kernel@lists.fedoraproject.org