f11 ppc64 woes - kernel - Fedora mailing-lists

6 Jun 2009


      I reproduced what jwb reported on the powerstation.
Mine is with F10 + updates userland, only the kernel seems to matter.
The test case is:
# modprobe iscsi_tcp
    Illegal Instruction
    #
On -16[278], same oops that jwb saw, wrong text appearing at a page boundary.
This kernel:
http://kojipkgs.fedoraproject.org/scratch/roland/task_1396640/kernel-vanilla...
does not exhibit the problem.  That should be all the same buildroot stuff,
and 2.6.29.4 with no extra patches.
OTOH, this kernel:
http://kojipkgs.fedoraproject.org/scratch/roland/task_1396192/kernel-2.6.29....
also does not exibit the problem.  That is normal -167 with all the same
patches, but built in dist-f10-updates-candidate buildroots.
But contrary to jwb's reports:
On my powerstation 2.6.29.3-159.fc11.ppc64 fails to boot:
Loading ipr moduipr 0001:01:01.0: IOA initialized.
    le
    scsi0 : IBM 572C Storage Adapter
    scsi 0:0:0:0: Direct-Access     IBM-ESXS ST373455SS       BA23 PQ: 0 ANSI: 5
    Oops: Exception in kernel mode, sig: 4 [#1]
    SMP NR_CPUS=128 NUMA Maple
    Modules linked in: ipr(+)
    NIP: c000000000400000 LR: c0000000003ffff8 CTR: c00000000044304c
    REGS: c0000000781a7620 TRAP: 0700   Not tainted  (2.6.29.3-159.fc11.ppc64)
    MSR: 9000000000089032 <EE,ME,IR,DR>  CR: 24000024  XER: 000fffff
    TASK = c000000078193500[70] 'scsi_scan_0' THREAD: c0000000781a4000 CPU: 2
    GPR00: c0000000003ffff8 c0000000781a78a0 c000000000e6c210 c000000000db0390 
    GPR04: 0000000000000058 c00000007be6e000 000000000000000a ffffffffffffffff 
    GPR08: 0000000000000000 c0000000781a4000 c00000007be6e618 0000000000000000 
    GPR12: 0000000044000028 c000000000ea2800 000000000021a344 000000000021a394 
    GPR16: 000000000021a368 0000000000000000 c00000007be6e000 c0000000781a7b50 
    GPR20: c000000078090800 0000000000000000 0000000000000000 c00000007be6e060 
    GPR24: c000000078090828 0000000000000000 0000000000000000 fffffffffffffffa 
    GPR28: c000000078400120 0000000000000000 c000000000e0d710 c0000000781a7810 
    NIP [c000000000400000] .transport_destroy_device+0x3c/0x50
    LR [c0000000003ffff8] .transport_destroy_device+0x34/0x50
    Call Trace:
    [c0000000781a78a0] [c000000000446c9c] .scsi_alloc_sdev+0x220/0x284 (unreliable)
    [c0000000781a7950] [c000000000446f8c] .scsi_probe_and_add_lun+0x168/0xda8
    [c0000000781a7ad0] [c000000000447fb0] .__scsi_scan_target+0x104/0x730
    [c0000000781a7c20] [c000000000448654] .scsi_scan_channel+0x78/0xe8
    [c0000000781a7ce0] [c0000000004489d4] .scsi_scan_host_selected+0x11c/0x1a8
    [c0000000781a7da0] [c000000000448b48] .do_scsi_scan_host+0xe8/0x10c
    [c0000000781a7e40] [c000000000448bb0] .do_scan_async+0x44/0x220
    [c0000000781a7ef0] [c0000000000c57f8] .kthread+0x90/0xe4
    [c0000000781a7f90] [c00000000002f730] .kernel_thread+0x54/0x70
    Instruction dump:
    fbe1fff8 f821ff71 7c3f0b78 ebc2d0f0 f87f0070 60000000 60000000 e87f0070 
    e89e8000 4bfff8cd 60000000 383f0090 <00001010> 00000008 00001013 0000000f 
    ---[ end trace 998c8eb2fd0a3b41 ]---
    usb 2-1.4: new full speed USB device using ohci_hcd and address 3
    usb 2-1.4: New USB device found, idVendor=05ac, idProduct=1003
It prints some more usb probe msgs, but nothing else.
I can then reboot it with Ctl-Alt-Del on the USB keyboard.
This is obviously a variant of the same problem.  
It's losing on clobbered instructions at a page boundary.
Same for -158.
Same for -157.
Same for -155.
Same for -154.
Man but these bastards boot slow.
Same for -152.
-142 does not do that, nor exhibit the problem in "modprobe iscsi_tcp".
buildroot diffs -142 vs -152:
@@ -23,8 +23,8 @@
     device-mapper-1.02.31-4.fc11.ppc64
     device-mapper-libs-1.02.31-4.fc11.ppc64
     diffutils-2.8.1-23.fc11.ppc64
    -e2fsprogs-1.41.4-8.fc11.ppc64
    -e2fsprogs-libs-1.41.4-8.fc11.ppc64
    +e2fsprogs-1.41.4-9.fc11.ppc64
    +e2fsprogs-libs-1.41.4-9.fc11.ppc64
     elfutils-0.140-2.fc11.ppc64
     elfutils-libelf-0.140-2.fc11.ppc64
     elfutils-libs-0.140-2.fc11.ppc64
    @@ -89,7 +89,7 @@
     nss-3.12.3-4.fc11.ppc64
     nss-softokn-freebl-3.12.3-4.fc11.ppc64
     openldap-2.4.15-3.fc11.ppc64
    -openssl-0.9.8k-1.fc11.ppc64
    +openssl-0.9.8k-4.fc11.ppc64
     pam-1.0.91-6.fc11.ppc64
     patch-2.5.4-38.fc11.ppc64
     pcre-7.8-2.fc11.ppc64
    @@ -102,7 +102,7 @@
     pkgconfig-0.23-8.fc11.ppc64
     policycoreutils-2.0.62-12.2.fc11.ppc64
     popt-1.13-5.fc11.ppc64
    -ppl-0.10.1-1.fc11.ppc64
    +ppl-0.10.2-2.fc11.ppc64
     procps-3.2.7-27.fc11.ppc64
     psmisc-22.6-9.fc11.ppc64
     readline-5.2-14.fc11.ppc64
ppl is a library used by gcc.  I eyeballed 0.10.1->0.10.2 changes
and I doubt it's involved (AFAIK no actual code changes there).
http://koji.fedoraproject.org/koji/taskinfo?taskID=1396771
is a scratch build of -142 in the current buildroots.
I'll go to sleep before it finishes, try it tomorrow.
I also tried:
koji build --nowait --scratch --arch-override=ppc64 --repo-id=72872 dist-f11 'cvs://cvs.fedoraproject.org/cvs/pkgs?rpms/kernel/F-11#kernel-2_6_29_3-152_fc11'
but:
    BuildError: Bad repo: 72872 (DELETED)
That would have been -152 but built in the buildroot -142 was built in.
I guess we'd need a temporary tag or something to recover the right repo
to do that build in koji.
lbr () 
{ 
    koji rpminfo $1 | awk '$1 == "Buildroot:" { print $2 }' | xargs -n1 koji list-buildroot
}
diff -u <(lbr kernel-2.6.29.3-142.fc11.ppc64) <(koji list-buildroot 470988)
for the buildroot diffs from -142 to the scratch -142 (current buildroot).
Now you know what I know and we still know nothing.
Oh, and note the two variant crashes in different kernels are in different
routines in different builds, but always at PC 0xc000000000400000,
and always clobbered the next few words with:
    00001010 00000008 00001013 0000000f
The magic PAGE_OFFSET+4MB effect.  So, youse gots to wonder, and...
On 2.6.29.3-142.fc11.ppc64, which has "no problem", I built the appended module.
It printed this:
Instruction dump:
e8090000 f8410028 7f83e378 e9690010 7fa5eb78 7c0903a6 e8490008 4e800421 
<00001010> 00000008 00001013 0000000f 7961626f 6f740000 00101600 00000c00 
    			      ^^^^^^^^ ^^^^ <-- spells "yaboot"
00000400 00101100 00000800 7fa3eb78 4bfff24d 60000000 38600000 383f00b0 
     	  	   ^^^^^^^^ <-- goes to correct text again from here
The magic 44 bytes of bogon at PAGE_OFFSET+4MB effect.
We have no idea how long we have been screwed.
I updated to yaboot-1.3.14-12.fc11.ppc (was f10), ran ybin, no help.
So long and thanks for all the geese,
Roland
======
#include <linux/module.h>
#include <asm/ptrace.h>
#include <asm/uaccess.h>
MODULE_DESCRIPTION("fmh");
MODULE_LICENSE("GPL");
int ninsn = 24;
module_param(ninsn, int, 0);
static void __exit exit_fmh(void)
{
}
static int __init init_fmh(void)
{
    int i;
    unsigned long pc = 0xc000000000400000 - 8*4;
printk(KERN_ERR "Instruction dump:");
for (i = 0; i < ninsn; i++) {
    	int instr;
if (!(i % 8))
    		printk("\n");
/* We use __get_user here *only* to avoid an OOPS on a
    	 * bad address because the pc *should* only be a
    	 * kernel address.
    	 */
    	if (
    	     __get_user(instr, (unsigned int __user *)pc)) {
    		printk("XXXXXXXX ");
    	} else {
    		if (0xc000000000400000 == pc)
    			printk("<%08x> ", instr);
    		else
    			printk("%08x ", instr);
    	}
pc += sizeof(int);
    }
printk("\n");
return -EAGAIN;
}
module_init(init_fmh);
module_exit(exit_fmh);