Hi,
TL,DR: I have problems with resuming from suspend to RAM on my new Ryzen computer. I have only seen this happen if I put root onto a btrfs subvolume, not on ext4. The proprietary nvidia driver seems to be one additional factor, but I have also seen this with the nvidia driver removed. Experienced something similar?
I have upgraded my PC with a Ryzen 7 3700X, Asus ROG STRIX x570-E Gaming and a Samsung 970 EVO PLUS NVMe SSD. The existing Fedora 32 system from my old SATA SSD worked flawlessly (with suspend to RAM). For the last years I have always used ext4 (previously ext3) on monolithic root partitions (no separate /boot or /home, but separate data partitions) on mbr partitioned SATA disks, booting in legacy BIOS mode. With the new drive I wanted to make the switch to GPT, UEFI boot and btrfs (ext4 /boot). I didn't want to install a new system (with those months of finding programs you have not yet installed) but opted to just copy over my old F32 system. So I used gparted to set up the GPT and a 200MiB EFI System partition, three 500MiB boot partitions (for different distributions or Fedora versions), a 16GiB swap partition and the rest of the drive as btrfs. In the btrfs volume I created a fedora32 subvolume with a nested home subvolume. I then mounted everything (/, /boot, /boot/efi) on my old f30 system and copied over everything from my f32 partition. After bind mounting /sys, /proc and /dev I chrooted into the new copy, adjusted the fstab, installed all efi related packages, ran grub2-mkconfig and made sure the kernel paths in /boot/loader/entries were correct. I then switched to the system rescue mode of a f32 netinstall USB drive booted in UEFI mode (to get access to the efivars) to install grub with target x86_64-efi and regenerate the initrds. After that everything booted up and seemed to work until I tried suspend to RAM. It went to sleep properly, but resuming did not complete. After waking up it just continued to display the last four kernel messages of the suspend action (suspending processes, ..., suspending terminal). It reacted to emergency sync sysrq (HDD LED blinking) but the other sysrq keys did not seem to work ("u" also provoked a blinking LED sometimes). This happened from within KDE as well as from text terminal with systemctl suspend. Log files after reboot just had entries until shortly before suspend (processes suspended, all except the last CPU core disabled, unneeded drives stopped) but not from the attempt to resume. I assumed this to be caused by the NVMe-SSD and unsuccessfully tried some suggested solutions that have worked for others with suspend problems with NVMe-SSDs (disabling acpiphp, disabling d3cold_allowed). Since I had too many variables I trashed the content of the new SSD and started anew with a mbr partition table to boot in legacy BIOS mode. I just plain cloned the original f32 partition to the NVMe SSD, adjusted the fstab, updated grub.cfg, recreated the initrds installed grub to the mbr and everything worked, including suspend. I then again did another copy with btrfs root (and ext4 /boot), this time on MBR with BIOS boot and it again showed the previous suspend problem. No swap space this time. I also did a new install of F32 (from Everything Netinstall with Plasma Workspace profile) with btrfs root and ext4 /boot, which suspended correctly at the beginning but failed to resume after I installed the proprietary nvidia driver for my graphics card. Removing the nvidia driver (and updating grub.cfg and the initrds) returned that install to a working state. I then removed the nvidia driver also on the second non-working copy of my old system (checked that "lsmod | grep nvidia" does not show anything), but suspend still did not work. It did not show the kernel messages but just a black screen with frozen mouse pointer. So the nvidia driver seems to be one way to trigger it but there apparently are other ways to reach the non-working state.
I have now trashed everything again and settled for GPT, UEFI and root on ext4 (no separate /boot) with /home on a btrfs subvolume as a compromise. This seems to be working fine. As I now have a btrfs /home my problem is also likely not caused by having files open on a btrfs partition.
The problems were with kernels 5.7.9-200.fc32 and 5.7.10-201.fc32 . I should likely also have tried an older kernel, but have not yet done so (might try to get a new non-working test setup tomorrow). Nvidia driver packages were version 440.100 from rpmfusion on the new install and a rebuild of the f33 packages of 450.57 for the existing install.
My hardware: - AMD Ryzen 7 3700X - Asus ROG STRIX x570-E Gaming (latest BIOS version 2407) - Samsung 970 EVO PLUS NVMe SSD - Geforce GTX960
Tested setups: Old Ext4 on MBR, SATA: working copy of old Ext4 on MBR, NVMe: working copy to BTRFS (Ext4 /boot, with nvidia) on GPT, UEFI, NVMe: not working copy to BTRFS (Ext4 /boot, with nvidia) on MBR, BIOS, NVMe: not working copy BTRFS (Ext4 /boot; nvidia removed) on MBR, BIOS, NVMe: not working new on BTRFS (Ext4 /boot, w/o nvidia) on MBR, BIOS, NVMe: working new on BTRFS (Ext4 /boot, with nvidia) on MBR, BIOS, NVMe: not working copy on Ext4 (btrfs /home, with nvidia) on MBR, BIOS, NVMe: working copy on Ext4 (btrfs /home, with nvidia) on GPT, UEFI, NVMe: working
So this seems to be unrelated to the partition table type and the boot mode. If it is related to NVMe this is just one factor. I have just observed it with / on BTRFS. On a new install the proprietary nvidia driver is also needed to trigger this, but on my old install it also occurred with the nvidia driver removed.
Things I have not tried yet (might try when I find the time again): - older kernel version - ext4 root but with separate boot partition (unlikely cause) - non-nvidia graphics card (don't have one) - logging kernel messages on different device using some serial output (there is a way, right?) to see what really is failing
Has anybody else experienced something similar? Is there something I might have missed in the btrfs conversion process? This might become interesting with F33 with lots of new btrfs systems.
Best regards,
Lukas
Hi,
So this seems to be unrelated to the partition table type and the boot mode. If it is related to NVMe this is just one factor. I have just observed it with / on BTRFS. On a new install the proprietary nvidia driver is also needed to trigger this, but on my old install it also occurred with the nvidia driver removed.
Things I have not tried yet (might try when I find the time again):
- older kernel version
- ext4 root but with separate boot partition (unlikely cause)
- non-nvidia graphics card (don't have one)
- logging kernel messages on different device using some serial output
(there is a way, right?) to see what really is failing
I have now done some more debugging and I think I found out what happens: The firmware loading for my Hauppauge WinTV-dualHD USB DVB tuner fails and locks up the system during resume when /usr/lib/firmware resides on a btrfs.
The resume fails if all of the following is true: 1. the root file system is btrfs (not ext4) 2. the DVB tuner USB device is plugged in 3. the nvidia driver or dvb-firmware from rpmfusion is installed
It might be needed that the root is on a NVMe drive, but I have not tested that, yet.
Apparently the Si2168-B40 on that device needs some firmware upload to function. During normal resume, the si2168 kernel module seems to load the binary firmware /usr/lib/firmware/dvb-demod-si2168-b40-01.fw into the device. During initial device initialization (boot or device plug) the firmware files are not loaded but just on first actual usage.
The third condition boils down to: There are disc accesses needed to get or find the firmware file.
If the firmware is already in RAM buffer and no actual disc access is needed, the resume succeeds. If the firmware file is not present, the resume also succeeds, but only if the content of /usr/lib/firmware is cached in RAM. If the directory content is not known, the kernel still tries to load the non-existing file and the resume fails. Installing the nvidia driver is harmful apparently not through the actual driver itself but through the non-usage of the nouveau driver, which also needs firmware files from /usr/lib/firmware. Loading of the nouveau driver apparently lists the content of that directory and the non-existence of the si2168 firmware files is cached in RAM.
A workaround seems to be (before suspend) ls -R /usr/lib/firmware/ > /dev/null cat /usr/lib/firmware/dvb* > /dev/null
The first part apparently is only needed if no other driver (like nouveau) has loaded firmware which lists and caches the content of /usr/lib/firmware . Another option is to actually use the tuner and trigger a firmware upload before suspend. This also puts the file into cache.
When the nouveau driver is used, and the system is running in text mode, and console_suspend of printk is disabled (echo N > /sys/module/printk/parameters/console_suspend) I can actually see kernel messages of the resume process, even when it is locking up. During a successful resume there are messages related to the firmware load. When resume is unsuccessful, however, there seem to be no messages related to the DVB tuner.
The problem is present at least on kernels 5.5.18, 5.6.19 and 5.7.10 , so this does not seem to be a recent regression.
I had a quick look into the kernel source file drivers/media/dvb-frontends/si2168.c and the actual firmware reading is done by request_firmware(). It might not be as safe to call during resume as the kernel documentation specifies when root is on a btrfs.
Any suggestions on where I should report my findings to have those properly looked into? I think I have narrowed the cause down far enough so that somebody with knowledge of the kernel code might have a chance to identify and fix the bug.
Best regards Lukas
On 8/1/20 1:51 PM, Lukas Middendorf wrote:
Any suggestions on where I should report my findings to have those properly looked into? I think I have narrowed the cause down far enough so that somebody with knowledge of the kernel code might have a chance to identify and fix the bug.
I would suggest filing a bug in the kernel bugzilla.
On Sat, Aug 1, 2020 at 3:04 PM Samuel Sieb samuel@sieb.net wrote:
On 8/1/20 1:51 PM, Lukas Middendorf wrote:
Any suggestions on where I should report my findings to have those properly looked into? I think I have narrowed the cause down far enough so that somebody with knowledge of the kernel code might have a chance to identify and fix the bug.
I would suggest filing a bug in the kernel bugzilla.
Sorry, just saw this and made the connection with the email on linux-btrfs@. Really great investigative work by Lukas.
There will soon be more info for Btrfs in Fedora, including bug reporting. The gist is:
- If A vs B testing shows a regression, i.e. ext4 (or XFS) vs Btrfs behavior difference, just like in this example, then you can file it in the Red Hat Bugzilla, classification Fedora, component kernel.
- After submitting the bug, it would be awesome if you can change the Assignee to fedora-kernel-btrfs@fedoraproject.org so the proper folks are notified.
If you're not sure if you've found a bug, or just have questions, I think the usual process of gradual escalation makes sense. Ask here, and if the usual helpful suspects don't have an answer, get my attention on IRC (cmurf) in any of #fedora, and I'll do my best to answer.