I'm having an issue shutting down. Fedora 34, just updated to 35, same problem. (It was around in 33 as well, I've been putting up with it.) I know that there is another list for beta versions, but this issue was there in the current version as well.
Somewhere around 1 out of every 10 shutdowns, the system goes down to "[ OK ] Reached target System Shutdown", but then, "[ *** ] a stop job is running for Cryptography setup for luks-<cycle through LUKS volumes>"
It typically goes for about 1/2 hour if I don't get fed up, and hold the power button down.
I just saw after that 1/2 hour: [Time] Timed out starting System Reboot Forcibly Rebooting: job timed out. audit: type=1334.........
I still had to power it off. On reboot, it seems OK.
I haven't found anything interesting in /var/log/messages regarding it.
Any suggestions to further troubleshooting?
Thanks, --murph
On Wed, 29 Sep 2021 14:17:01 -0400 murph nj murphnj+fedora@gmail.com wrote:
I'm having an issue shutting down.
[snip]
I haven't found anything interesting in /var/log/messages regarding it.
Any suggestions to further troubleshooting?
I think this is an artifact of the shutdown process. It is supposition on my part, but I think it is just like the startup process, so that things are shutdown asynchronously. Thus from time to time, a dependency is not met because something is too slow or too fast, and it leads to a race condition, and the outcome you are seeing. I also occasionally see the same thing (in rawhide, future F36), but for a different process. I think the last one I saw was for something called 'User Manager'. You might also check the journal with journalctl -r and page down to the last shutdown. It is possible systemd placed a message there preferentially. Is there any kind of message about waiting for something? Other than that, it would require digging into the scheduling of systemd during shutdown, looking at dependencies for the process you are waiting on, and tweaking timings through trial and error to see which removes it. Infeasible, I think. If systemd had a synchronous (sequential) startup and shutdown, these could be deterministically tuned out, but it would probably be much slower on both sides.
You could open a bugzilla just in case someone knowledgeable immediately recognizes the problem and fixes it. Or, the next time they are tweaking that area, they might remember it and have a look for anything obvious.
On Wed, Sep 29, 2021 at 2:18 PM murph nj murphnj+fedora@gmail.com wrote:
I'm having an issue shutting down. Fedora 34, just updated to 35, same problem. (It was around in 33 as well, I've been putting up with it.) I know that there is another list for beta versions, but this issue was there in the current version as well.
Somewhere around 1 out of every 10 shutdowns, the system goes down to "[ OK ] Reached target System Shutdown", but then, "[ *** ] a stop job is running for Cryptography setup for luks-<cycle through LUKS volumes>"
It typically goes for about 1/2 hour if I don't get fed up, and hold the power button down.
I just saw after that 1/2 hour: [Time] Timed out starting System Reboot Forcibly Rebooting: job timed out. audit: type=1334.........
I still had to power it off. On reboot, it seems OK.
I haven't found anything interesting in /var/log/messages regarding it.
Any suggestions to further troubleshooting?
Seems likely to me some service is not quitting properly, preventing / from being unmounted, which prevents cryptsetup from closing the dm-crypt device. And as tedious as it is, the trick is to find out what that service is. You might also get a clue by comparing the shutdown sequences between successful and unsuccessful variants. The successful log will list all kinds of services being quit, whereas the unsuccessful one will be missing two or more services. Maybe from that you can infer what the problem is.
Another idea: boot using systemd.log_level=debug to get more information in the journal (a *lot* more information). And also consider adding systemd.debug-shell=1 as well, but note that this is a security risk because it will persistently put a root level shell on tty9, which is what you'll switch to with control-alt-F9 when you get the shutdown hang. And then do:
systemctl list-jobs df lsof <any mount point backed by the hung cryptsetup job>
On 9/30/21 12:41, Chris Murphy wrote:
Seems likely to me some service is not quitting properly, preventing / from being unmounted
If that were the case, there might be information about an exit failure in the log. On the next boot, "journalctl -b -1" might have useful info.
I also wonder if enabling sysrq and using sysrq+'t' would help determine which process is stuck:
https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html
On Thu, Sep 30, 2021 at 4:13 PM Gordon Messmer gordon.messmer@gmail.com wrote:
On 9/30/21 12:41, Chris Murphy wrote:
Seems likely to me some service is not quitting properly, preventing / from being unmounted
If that were the case, there might be information about an exit failure in the log. On the next boot, "journalctl -b -1" might have useful info.
I also wonder if enabling sysrq and using sysrq+'t' would help determine which process is stuck:
It might but it's also really verbose, so for starters "systemctl list-jobs" is the path of least resistance I suspect. Report suggests the systemd cylon eye, which suggests a job is running and not exiting as you say, so list the jobs and then do a systemctl status on each one of them to find out what they are up to.
On Thu, Sep 30, 2021 at 8:35 PM Chris Murphy lists@colorremedies.com wrote:
On Thu, Sep 30, 2021 at 4:13 PM Gordon Messmer gordon.messmer@gmail.com wrote:
On 9/30/21 12:41, Chris Murphy wrote:
Seems likely to me some service is not quitting properly, preventing / from being unmounted
If that were the case, there might be information about an exit failure in the log. On the next boot, "journalctl -b -1" might have useful info.
I also wonder if enabling sysrq and using sysrq+'t' would help determine which process is stuck:
It might but it's also really verbose, so for starters "systemctl list-jobs" is the path of least resistance I suspect. Report suggests the systemd cylon eye, which suggests a job is running and not exiting as you say, so list the jobs and then do a systemctl status on each one of them to find out what they are up to.
Well, I finally enabled the shell, and it (as it is intermittent) hung up
on shutdown.
lsof showed nothing that I could see.
The stop job rotates between four mounted volumes, and the lvm2-monitor.service. (The lvm2.monitor finally stopped, the volumes have not)
systemctl status doesn't seem to display much, but the stop job (and cylon eye) keep overwriting everything, making it really hard to see the output.
Perhaps let me let everyone off the hook though. I was thinking about it, and started digging into the logs. This system has been upgraded up from at least Fedora 30, my dnf logs start in 2019-06. (It may even predate that, not sure)
So, as far as dragging people through debugging seems somewhat futile to me, unless collectively you think that it is worthwhile to pursue this. I'm game if this can help the Fedora community, but the system is working, I can back up, and do a fresh install after a long time of good service after so many upgrades. (I've done all of them, and even switched to the beta track when it got close a few times)
Any thoughts?
On Wed, 1 Dec 2021 21:03:16 -0500 murph nj murphnj+fedora@gmail.com wrote:
So, as far as dragging people through debugging seems somewhat futile to me, unless collectively you think that it is worthwhile to pursue this. I'm game if this can help the Fedora community, but the system is working, I can back up, and do a fresh install after a long time of good service after so many upgrades. (I've done all of them, and even switched to the beta track when it got close a few times)
Any thoughts?
I have no thoughts on investigating or solving your problem, but I think your logic about doing a new install of the latest version to finesse the problem is valid. Everyone who has expertise in working on this will have long ago moved on to the the latest versions of both the OS and the packages. They have no way of investigating your problem directly. And, additionally, all the little intermittent tweaks tend to get lost to memory in the noise of constant change, so their advice if you choose to continue investigating will reflect that.
On 12/1/21 18:03, murph nj wrote:
lsof showed nothing that I could see.
Nothing, or nothing interesting? I'd expect you to at least see the CWD of the debug shell.
The stop job rotates between four mounted volumes, and the lvm2-monitor.service.
In the past, I've seen lvm2-monitor behave badly in reaction to logical volumes (block devices) used for VMs that used LVM internally. Do you have any logical volumes that aren't used for those four mounted volumes?
systemctl status doesn't seem to display much, but the stop job (and cylon eye) keep overwriting everything, making it really hard to see the output.
Several suggestions: Chris suggested that you check "systemctl list-jobs". The contents of /proc/mounts might be interesting. I'd run "lsof /path" where path is the longest mount point path of the volumes still listed in /proc/mounts.
On Thu, Dec 2, 2021 at 2:06 PM Gordon Messmer gordon.messmer@gmail.com wrote:
On 12/1/21 18:03, murph nj wrote:
lsof showed nothing that I could see.
Nothing, or nothing interesting? I'd expect you to at least see the CWD of the debug shell.
The error messages were continuously overwriting everything in the shell. If it was a single line, I may not have been able to read it.
The stop job rotates between four mounted volumes, and the lvm2-monitor.service.
In the past, I've seen lvm2-monitor behave badly in reaction to logical volumes (block devices) used for VMs that used LVM internally. Do you have any logical volumes that aren't used for those four mounted volumes?
No LVM on this system. (I don't recall why, I'm not averse to LVM, but this one doesn't have it.)
systemctl status doesn't seem to display much, but the stop job (and cylon eye) keep overwriting everything, making it really hard to see the output.
Several suggestions: Chris suggested that you check "systemctl list-jobs". The contents of /proc/mounts might be interesting. I'd run "lsof /path" where path is the longest mount point path of the volumes still listed in /proc/mounts.
I'll give it a shot next time it hangs up on shutdown. Any ideas on how to make the error messages abate (in the debug shell) so that I can read these things more easily?
Thanks, --murph
On Thu, Dec 2, 2021 at 2:06 PM Gordon Messmer gordon.messmer@gmail.com wrote:
On 12/1/21 18:03, murph nj wrote:
lsof showed nothing that I could see.
Nothing, or nothing interesting? I'd expect you to at least see the CWD of the debug shell.
Got another opportunity, redirected lsof /home and lsof /other to files, and there was no output.
The stop job rotates between four mounted volumes, and the lvm2-monitor.service.
In the past, I've seen lvm2-monitor behave badly in reaction to logical volumes (block devices) used for VMs that used LVM internally. Do you have any logical volumes that aren't used for those four mounted volumes?
No LVM, as mentioned earlier
systemctl status doesn't seem to display much, but the stop job (and cylon eye) keep overwriting everything, making it really hard to see the output.
Several suggestions: Chris suggested that you check "systemctl list-jobs". The contents of /proc/mounts might be interesting. I'd run "lsof /path" where path is the longest mount point path of the volumes still listed in /proc/mounts.
I did get output from systemctl-list-jobs (edited lightly)
JOB UNIT TYPE STATE 5788 poweroff.target start waiting 6039 systemd-cryptsetup@luks\x2dshome.service stop running 5790 shutdown.target start waiting 6031 final.target start waiting 6068 systemd-cryptsetup@luks\x2de<snip>8.service stop running 6055 systemd-cryptsetup@luks\x2dsother.service stop running 6013 lvm2-monitor.service stop running 6047 systemd-cryptsetup@luks\x2d2<snip>3.service stop running 6032 umount.target start waiting 5789 systemd-poweroff.service start waiting
10 jobs listed.
On 12/4/21 19:15, murph nj wrote:
Got another opportunity, redirected lsof /home and lsof /other to files, and there was no output.
Interesting... Is there an error if you try to unmount those yourself?
I did get output from systemctl-list-jobs (edited lightly)
JOB UNIT TYPE STATE 6039 systemd-cryptsetup@luks\x2dshome.service stop running 6068 systemd-cryptsetup@luks\x2de<snip>8.service stop running 6055 systemd-cryptsetup@luks\x2dsother.service stop running 6047 systemd-cryptsetup@luks\x2d2<snip>3.service stop running
Home and other make sense given your earlier comment, but what are the other two? Do the number of entries in /etc/crypttab match the number of LUKS partitions? (What's in /proc/mounts during the failed shutdown?)
On Sat, Dec 4, 2021 at 11:33 PM Gordon Messmer gordon.messmer@gmail.com wrote:
On 12/4/21 19:15, murph nj wrote:
Got another opportunity, redirected lsof /home and lsof /other to files, and there was no output.
Interesting... Is there an error if you try to unmount those yourself?
Remember, all of the output (and the command line I'm typing on) is immediately erased by the cylon eye, so I'll try (next time it does it) and try to redirect stdout, and stderr to a file, typing blind.
I did get output from systemctl-list-jobs (edited lightly)
JOB UNIT TYPE STATE 6039 systemd-cryptsetup@luks\x2dshome.service stop running 6068 systemd-cryptsetup@luks\x2de<snip>8.service stop running 6055 systemd-cryptsetup@luks\x2dsother.service stop running 6047 systemd-cryptsetup@luks\x2d2<snip>3.service stop running
Home and other make sense given your earlier comment, but what are the other two? Do the number of entries in /etc/crypttab match the number of LUKS partitions? (What's in /proc/mounts during the failed shutdown?)
I think they do line up, I'll do my best to confirm, and get the information from /proc/mounts.
When I try to umount them: /home and /other show as not mounted. / on the other hand, says busy.
There are 4 entries in crypttab, the two listed by name above, presumably /, and I don't know what the other is..... It is only listed by uuid, not sure how to correlate it to a mounted drive.
When I got the output of /proc/mounts, lots of logical entries, and "/" Appeared to be the only real partition mounted. (proc, sysfs, none, and many more were there)
On Sat, Dec 4, 2021 at 11:33 PM Gordon Messmer gordon.messmer@gmail.com wrote:
On 12/4/21 19:15, murph nj wrote:
Got another opportunity, redirected lsof /home and lsof /other to files, and there was no output.
Interesting... Is there an error if you try to unmount those yourself?
I did get output from systemctl-list-jobs (edited lightly)
JOB UNIT TYPE STATE 6039 systemd-cryptsetup@luks\x2dshome.service stop running 6068 systemd-cryptsetup@luks\x2de<snip>8.service stop running 6055 systemd-cryptsetup@luks\x2dsother.service stop running 6047 systemd-cryptsetup@luks\x2d2<snip>3.service stop running
Home and other make sense given your earlier comment, but what are the other two? Do the number of entries in /etc/crypttab match the number of LUKS partitions? (What's in /proc/mounts during the failed shutdown?)
When I try to umount them: /home and /other show as not mounted. / on the other hand, says busy.
There are 4 entries in crypttab, the two listed by name above, presumably /, and I don't know what the other is..... It is only listed by uuid, not sure how to correlate it to a mounted drive.
When I got the output of /proc/mounts, (from the console when it was failing to shut down) lots of logical entries, and "/" Appeared to be the only real partition mounted. (proc, sysfs, none, and many more were there)
users@lists.stg.fedoraproject.org