resume from suspend to RAM not working properly with / on btrfs - users - Fedora mailing-lists

26 Jul 2020


      Hi,
TL,DR: I have problems with resuming from suspend to RAM on my new Ryzen 
computer. I have only seen this happen if I put root onto a btrfs 
subvolume, not on ext4. The proprietary nvidia driver seems to be one 
additional factor, but I have also seen this with the nvidia driver 
removed. Experienced something similar?
I have upgraded my PC with a Ryzen 7 3700X, Asus ROG STRIX x570-E Gaming 
and a Samsung 970 EVO PLUS NVMe SSD.
The existing Fedora 32 system from my old SATA SSD worked flawlessly 
(with suspend to RAM).
For the last years I have always used ext4 (previously ext3) on 
monolithic root partitions (no separate /boot or /home, but separate 
data partitions) on mbr partitioned SATA disks, booting in legacy BIOS mode.
With the new drive I wanted to make the switch to GPT, UEFI boot and 
btrfs (ext4 /boot). I didn't want to install a new system (with those 
months of finding programs you have not yet installed) but opted to just 
copy over my old F32 system.
So I used gparted to set up the GPT and a 200MiB EFI System partition, 
three 500MiB boot partitions (for different distributions or Fedora 
versions), a 16GiB swap partition and the rest of the drive as btrfs. In 
the btrfs volume I created a fedora32 subvolume with a nested home 
subvolume. I then mounted everything (/, /boot, /boot/efi) on my old f30 
system and copied over everything from my f32 partition. After bind 
mounting /sys, /proc and /dev I chrooted into the new copy, adjusted the 
fstab, installed all efi related packages, ran grub2-mkconfig and made 
sure the kernel paths in /boot/loader/entries were correct. I then 
switched to the system rescue mode of a f32 netinstall USB drive booted 
in UEFI mode (to get access to the efivars) to install grub with target 
x86_64-efi and regenerate the initrds.
After that everything booted up and seemed to work until I tried suspend 
to RAM. It went to sleep properly, but resuming did not complete. After 
waking up it just continued to display the last four kernel messages of 
the suspend action (suspending processes, ..., suspending terminal). It 
reacted to emergency sync sysrq (HDD LED blinking) but the other sysrq 
keys did not seem to work ("u" also provoked a blinking LED sometimes). 
This happened from within KDE as well as from text terminal with 
systemctl suspend. Log files after reboot just had entries until shortly 
before suspend (processes suspended, all except the last CPU core 
disabled, unneeded drives stopped) but not from the attempt to resume.
I assumed this to be caused by the NVMe-SSD and unsuccessfully tried 
some suggested solutions that have worked for others with suspend 
problems with NVMe-SSDs (disabling acpiphp, disabling d3cold_allowed). 
Since I had too many variables I trashed the content of the new SSD and 
started anew with a mbr partition table to boot in legacy BIOS mode. I 
just plain cloned the original f32 partition to the NVMe SSD, adjusted 
the fstab, updated grub.cfg, recreated the initrds installed grub to the 
mbr and everything worked, including suspend.
I then again did another copy with btrfs root (and ext4 /boot), this 
time on MBR with BIOS boot and it again showed the previous suspend 
problem. No swap space this time.
I also did a new install of F32 (from Everything Netinstall with Plasma 
Workspace profile) with btrfs root and ext4 /boot, which suspended 
correctly at the beginning but failed to resume after I installed the 
proprietary nvidia driver for my graphics card. Removing the nvidia 
driver (and updating grub.cfg and the initrds) returned that install to 
a working state.
I then removed the nvidia driver also on the second non-working copy of 
my old system (checked that "lsmod | grep nvidia" does not show 
anything), but suspend still did not work. It did not show the kernel 
messages but just a black screen with frozen mouse pointer. So the 
nvidia driver seems to be one way to trigger it but there apparently are 
other ways to reach the non-working state.
I have now trashed everything again and settled for GPT, UEFI and root 
on ext4 (no separate /boot) with /home on a btrfs subvolume as a 
compromise. This seems to be working fine. As I now have a btrfs /home 
my problem is also likely not caused by having files open on a btrfs 
partition.
The problems were with kernels 5.7.9-200.fc32 and 5.7.10-201.fc32 . I 
should likely also have tried an older kernel, but have not yet done so 
(might try to get a new non-working test setup tomorrow).
Nvidia driver packages were version 440.100 from rpmfusion on the new 
install and a rebuild of the f33 packages of 450.57 for the existing 
install.
My hardware:
- AMD Ryzen 7 3700X
- Asus ROG STRIX x570-E Gaming (latest BIOS version 2407)
- Samsung 970 EVO PLUS NVMe SSD
- Geforce GTX960
Tested setups:
Old Ext4 on MBR, SATA: working
copy of old Ext4 on MBR, NVMe: working
copy to BTRFS (Ext4 /boot, with nvidia) on GPT, UEFI, NVMe: not working
copy to BTRFS (Ext4 /boot, with nvidia) on MBR, BIOS, NVMe: not working
copy BTRFS (Ext4 /boot; nvidia removed) on MBR, BIOS, NVMe: not working
new on BTRFS (Ext4 /boot, w/o nvidia) on MBR, BIOS, NVMe: working
new on BTRFS (Ext4 /boot, with nvidia) on MBR, BIOS, NVMe: not working
copy on Ext4 (btrfs /home, with nvidia) on MBR, BIOS, NVMe: working
copy on Ext4 (btrfs /home, with nvidia) on GPT, UEFI, NVMe: working
So this seems to be unrelated to the partition table type and the boot 
mode. If it is related to NVMe this is just one factor. I have just 
observed it with / on BTRFS. On a new install the proprietary nvidia 
driver is also needed to trigger this, but on my old install it also 
occurred with the nvidia driver removed.
Things I have not tried yet (might try when I find the time again):
- older kernel version
- ext4 root but with separate boot partition (unlikely cause)
- non-nvidia graphics card (don't have one)
- logging kernel messages on different device using some serial output
   (there is a way, right?) to see what really is failing
Has anybody else experienced something similar? Is there something I 
might have missed in the btrfs conversion process?
This might become interesting with F33 with lots of new btrfs systems.
Best regards,
Lukas