F25 System Wide Change: KillUserProcesses=yes by default

List overview All Threads
Download

newer

older

REMINDER: Fedora 25 Alpha Freeze...

Status of Blender 3D

Jan Kurik

7 Jul 2016 7 Jul '16

7:13 a.m.

= Proposed System Wide Change: KillUserProcesses=yes by default = https://fedoraproject.org/wiki/Changes/KillUserProcesses_by_default

Change owner(s): * Zbigniew Jędrzejewski-Szmek zbyszek@in.waw.pl

Set the default policy to terminate processes in session scope when the user logs out. Specifically, systemd-logind's KillUserProcesses setting, which currently is set to "no" to override the upstream default, will be removed to follow the upstream default of "yes".

== Detailed Description == Since the introduction of systemd-logind a few years back, when a session is created, systemd hooks into the PAM session creation step to move the process that starts the session into a separate cgroup. This means that processes which are started as part of the session can be reliably tracked, even if they detach from the terminal and daemonize. When a user session terminates, various processes started as part of the user session (initally) remain alive. When the session is terminated, remaining processes receive a HUP signal (*), which can be and often is ignored.

Under the proposed setting of KillUserProcesses=yes, systemd will forcibly terminate (using SIGTERM and then SIGKILL) all processes which are part of the session scope (the cgroup created for the login session) when the user logs out. In order for a process to avoid being killed it has to be part of a different systemd unit. For user processes this can be achieved in two primary ways: by starting the unit as a service (e.g. 'systemd-run --user /usr/bin/foo', or creating a dedicated user service unit), or by telling systemd to create a new scope unit to encompass a specific process (e.g. 'systemd-run --user --scope /usr/bin/foo', or making a dbus call to create a scope unit directly). This step can be integrated directly into programs when this makes sense for their primary use case, e.g. screen.

(*) Whether SIGHUP is sent depends on a few factors: bash sends it children, tcsh does not, and the kernel also sends SIGHUP to processes which have a terminal open.

== Scope == * Proposal owners: - work upstream to clarify what is the best way for programs to mark themselves to survive logout - update the documentation with more explanations and examples, as we learn what people find confusing in the current scheme of things - evaluate a "permissive" mode for KillUserProcesses, to make it easier to debug processes which stay around after a session terminates - remove the compile-time override in the systemd package - work with upstream authors and Fedora maintainers of programs like screen and tmux to implement the ability to automatically start them in a way that survives a user session, and if the system policy does not allow that, to warn the user.

* Other developers: - cooperate on the last item from previous point - identify additional services which need to adapt to the changed default.

Different services might merit different handling here: some might be updated them to start through the non-session-specific dbus instance, some might need documentation changes, while others possibly should be handled like tmux and screen.

* Release engineering: N/A

* List of deliverables: N/A (not a System Wide Change)

* Policies and guidelines: - a Fedora Magazine article or similar to publicize the change would be nice

* Trademark approval: N/A (not needed for this Change)

-- Jan Kuřík Platform & Fedora Program Manager Red Hat Czech s.r.o., Purkynova 99/71, 612 45 Brno, Czech Republic

Show replies by date

Garrett Holmstrom

7 Jul 7 Jul

1:46 p.m.

On 2016-07-07 05:13, Jan Kurik wrote:

...

== Scope ==

Proposal owners:

work upstream to clarify what is the best way for programs to mark

themselves to survive logout

update the documentation with more explanations and examples, as we

learn what people find confusing in the current scheme of things

evaluate a "permissive" mode for KillUserProcesses, to make it

easier to debug processes which stay around after a session terminates

remove the compile-time override in the systemd package

work with upstream authors and Fedora maintainers of programs like

screen and tmux to implement the ability to automatically start them in a way that survives a user session, and if the system policy does not allow that, to warn the user.

Other developers:

cooperate on the last item from previous point

identify additional services which need to adapt to the changed default.

Different services might merit different handling here: some might be updated them to start through the non-session-specific dbus instance, some might need documentation changes, while others possibly should be handled like tmux and screen.

Release engineering: N/A

List of deliverables: N/A (not a System Wide Change)

But this is a system-wide change. Is the intention to fill out this list as people learn what needs to be changed?

-- Garrett Holmstrom

Jan Kurik

1:51 p.m.

On Thu, Jul 7, 2016 at 8:46 PM, Garrett Holmstrom gholms@fedoraproject.org wrote:

...

On 2016-07-07 05:13, Jan Kurik wrote:

...
== Scope ==

Proposal owners:

work upstream to clarify what is the best way for programs to mark

themselves to survive logout

update the documentation with more explanations and examples, as we

learn what people find confusing in the current scheme of things

evaluate a "permissive" mode for KillUserProcesses, to make it

easier to debug processes which stay around after a session terminates

remove the compile-time override in the systemd package

work with upstream authors and Fedora maintainers of programs like

screen and tmux to implement the ability to automatically start them in a way that survives a user session, and if the system policy does not allow that, to warn the user.

Other developers:

cooperate on the last item from previous point

identify additional services which need to adapt to the changed default.

Different services might merit different handling here: some might be updated them to start through the non-session-specific dbus instance, some might need documentation changes, while others possibly should be handled like tmux and screen.

Release engineering: N/A

List of deliverables: N/A (not a System Wide Change)

But this is a system-wide change. Is the intention to fill out this list as people learn what needs to be changed?

That is my fault as I overlooked this. It is fixed now on the wiki.

Regards, Jan

...

-- Garrett Holmstrom -- devel mailing list devel@lists.fedoraproject.org https://lists.fedoraproject.org/admin/lists/devel@lists.fedoraproject.org

-- Jan Kuřík Platform & Fedora Program Manager Red Hat Czech s.r.o., Purkynova 99/71, 612 45 Brno, Czech Republic

Zbigniew Jędrzejewski-Szmek

8 Jul 8 Jul

9:36 a.m.

On Thu, Jul 07, 2016 at 08:51:38PM +0200, Jan Kurik wrote:

...

On Thu, Jul 7, 2016 at 8:46 PM, Garrett Holmstrom gholms@fedoraproject.org wrote:

...
On 2016-07-07 05:13, Jan Kurik wrote:

...
== Scope ==

Proposal owners:

work upstream to clarify what is the best way for programs to mark

themselves to survive logout

update the documentation with more explanations and examples, as we

learn what people find confusing in the current scheme of things

evaluate a "permissive" mode for KillUserProcesses, to make it

easier to debug processes which stay around after a session terminates

remove the compile-time override in the systemd package

work with upstream authors and Fedora maintainers of programs like

screen and tmux to implement the ability to automatically start them in a way that survives a user session, and if the system policy does not allow that, to warn the user.

Other developers:

cooperate on the last item from previous point

identify additional services which need to adapt to the changed default.

Different services might merit different handling here: some might be updated them to start through the non-session-specific dbus instance, some might need documentation changes, while others possibly should be handled like tmux and screen.

Release engineering: N/A

List of deliverables: N/A (not a System Wide Change)

But this is a system-wide change. Is the intention to fill out this list as people learn what needs to be changed?

That is my fault as I overlooked this. It is fixed now on the wiki.

I think the list of things to be changed is in the "Scope" section. "Deliverables" sounds like new "things" that would be created, e.g. a installation image or whatever.

Zbyszek

Nico Kadel-Garcia

9 Jul 9 Jul

6:32 a.m.

On Thu, Jul 7, 2016 at 8:13 AM, Jan Kurik jkurik@redhat.com wrote:

...

= Proposed System Wide Change: KillUserProcesses=yes by default = https://fedoraproject.org/wiki/Changes/KillUserProcesses_by_default

Change owner(s):

Zbigniew Jędrzejewski-Szmek zbyszek@in.waw.pl

Set the default policy to terminate processes in session scope when the user logs out. Specifically, systemd-logind's KillUserProcesses setting, which currently is set to "no" to override the upstream default, will be removed to follow the upstream default of "yes".

We already discussed this idea on this mailing list. It's a *horrible* idea. It breaks screen, nohup processes and all backgrounded tasks, especially if the user has lost their remote SSH session or closed their console session. As I understand the "feature", it produces *no log whatsoever* of having done so. "We'll kill stuff without telling anyone" is like pullling the fuses of someone else's house to save electricity without warning them. It's not reasonable behavior.

I've plenty of colleagues who write lengthy compilation tasks or mysql queries and log out to let them run in the background. They use nohup, they use "screen" sessions, and they even use NX sessions. So do I, with lengthy compilation tasks taking 20 hours on my own VM or machine, and so does this recent perl script I saw that waits 10 minutes at a time to check for a flag file befpre running a task, and is supposed to always stays resident.

Simply killing all of those because the logged in users lost their direct connection would get most admins fired, long before killing idle processes would earn them the political capital to save their jobs.

For a larger environment, it still shouldn't be killing the tasks automatically, That's what scheduled nightly reboots, or nightly audits and autokills with user email notifications are for.

Adam Williamson

11:03 a.m.

On Sat, 2016-07-09 at 07:32 -0400, Nico Kadel-Garcia wrote:

...

On Thu, Jul 7, 2016 at 8:13 AM, Jan Kurik jkurik@redhat.com wrote:

...
= Proposed System Wide Change: KillUserProcesses=yes by default = https://fedoraproject.org/wiki/Changes/KillUserProcesses_by_default

Change owner(s):

Zbigniew Jędrzejewski-Szmek zbyszek@in.waw.pl

Set the default policy to terminate processes in session scope when the user logs out. Specifically, systemd-logind's KillUserProcesses setting, which currently is set to "no" to override the upstream default, will be removed to follow the upstream default of "yes".

We already discussed this idea on this mailing list.

Well, yeah, and a lot of people said "this should be a Change". So it was proposed as one...

-- Adam Williamson Fedora QA Community Monkey IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net http://www.happyassassin.net

Zbigniew Jędrzejewski-Szmek

3:46 p.m.

On Sat, Jul 09, 2016 at 07:32:01AM -0400, Nico Kadel-Garcia wrote:

...

On Thu, Jul 7, 2016 at 8:13 AM, Jan Kurik jkurik@redhat.com wrote:

...
= Proposed System Wide Change: KillUserProcesses=yes by default = https://fedoraproject.org/wiki/Changes/KillUserProcesses_by_default

Change owner(s):

Zbigniew Jędrzejewski-Szmek zbyszek@in.waw.pl

Set the default policy to terminate processes in session scope when the user logs out. Specifically, systemd-logind's KillUserProcesses setting, which currently is set to "no" to override the upstream default, will be removed to follow the upstream default of "yes".

We already discussed this idea on this mailing list. It's a *horrible* idea. It breaks screen, nohup processes and all backgrounded tasks,

Right, the next paragraph that you helpfully snipped, talks about changing screen to automatically register itself with systemd to avoid being killed. So let's discuss the change as proposed, with the assumption that we modify common run-stuff-in-the-background-on-purpose-style programs so that they continue to work as expected.

[...]

...

For a larger environment, it still shouldn't be killing the tasks automatically, That's what scheduled nightly reboots, or nightly audits and autokills with user email notifications are for.

That sounds like a much worse solution in every regard — because the issue of having to mark processes to be exempt from killing is still present, but the process to get rid of unwanted processes is asynchronous, heavyweight, nonstandard, and requires a lot of admin engagement. But if you have this kind of setup in place, then simply set KillUserProcesses=no and carry on.

Zbyszek

Chris Murphy

3:56 p.m.

On Sat, Jul 9, 2016 at 2:46 PM, Zbigniew Jędrzejewski-Szmek zbyszek@in.waw.pl wrote:

...

On Sat, Jul 09, 2016 at 07:32:01AM -0400, Nico Kadel-Garcia wrote:

...
On Thu, Jul 7, 2016 at 8:13 AM, Jan Kurik jkurik@redhat.com wrote:

...
= Proposed System Wide Change: KillUserProcesses=yes by default = https://fedoraproject.org/wiki/Changes/KillUserProcesses_by_default

Change owner(s):

Zbigniew Jędrzejewski-Szmek zbyszek@in.waw.pl

Set the default policy to terminate processes in session scope when the user logs out. Specifically, systemd-logind's KillUserProcesses setting, which currently is set to "no" to override the upstream default, will be removed to follow the upstream default of "yes".

We already discussed this idea on this mailing list. It's a *horrible* idea. It breaks screen, nohup processes and all backgrounded tasks,

Right, the next paragraph that you helpfully snipped, talks about changing screen to automatically register itself with systemd to avoid being killed. So let's discuss the change as proposed, with the assumption that we modify common run-stuff-in-the-background-on-purpose-style programs so that they continue to work as expected.

Well I mentioned btrfs balance and scrub, and those seem sufficiently common for Btrfs users, who are themselves not anywhere in the majority, that it concerns me how many programs out there are just going to break. I remain unconvinced the scope of what will be affected by the change is understood.

...

[...]

...
For a larger environment, it still shouldn't be killing the tasks automatically, That's what scheduled nightly reboots, or nightly audits and autokills with user email notifications are for.

That sounds like a much worse solution in every regard — because the issue of having to mark processes to be exempt from killing is still present, but the process to get rid of unwanted processes is asynchronous, heavyweight, nonstandard, and requires a lot of admin engagement. But if you have this kind of setup in place, then simply set KillUserProcesses=no and carry on.

I think this needs to be rethought. The options right now are, modify an as yet unknown quantity of background programs so they aren't killed on user logout; vs logout/restart/shutdown likely hanging for 90 seconds. It seems the work around would be to modify screen and tmux, and then run all such background tasks in either screen or tmux. But, that's kinda, wow... bit of a hammer.

-- Chris Murphy

Ben Rosser

4:09 p.m.

On Sat, Jul 9, 2016 at 4:56 PM, Chris Murphy lists@colorremedies.com wrote:

...

I think this needs to be rethought. The options right now are, modify an as yet unknown quantity of background programs so they aren't killed on user logout; vs logout/restart/shutdown likely hanging for 90 seconds. It seems the work around would be to modify screen and tmux, and then run all such background tasks in either screen or tmux. But, that's kinda, wow... bit of a hammer.

A thought occurred to me: would it be possible to instead implement a whitelist of *binaries* that are allowed to linger, rather than going around patching everything? So for example rather than having to modify the codebase of screen, we have a (sysadmin-modifiable) whitelist that says /usr/bin/screen is allowed to linger? Perhaps this would be something shipped by the screen package, so /usr/bin/screen is only whitelisted if the package providing it installed.

Just an idea that I probably haven't fully thought through yet... it may have even been something mentioned on the last thread. This just seems like it might be a better approach than trying to patch an arbitrary number of programs, though?

Ben Rosser

Alex Thomas

5:20 p.m.

+1 to the idea that we really do not know how many programs will be affected by this change. We do know that the tmux folks have shot down making any changes to accommodate systemd. As they value cross-platform compatibility, this is understandable.

As it looks from my vantage point, the choice is either carry a patch to revert this change in systemd, or accept the load of carrying an unknown number of patches to allow other software to accommodate this change.

My suggestion is to plan on reverting the systemd change to KillUserProcess=no in F25, and reevaluate for F26, when we have a better understanding of just how many programs have to be rewritten/patched/forked to accommodate this new paradigm.

Dan Book

9:41 p.m.

On Sat, Jul 9, 2016 at 6:20 PM, Alex Thomas karlthane@gmail.com wrote:

...

 As it looks from my vantage point, the choice is either carry a patch
to revert this change in systemd, or accept the load of carrying an unknown number of patches to allow other software to accommodate this change.
 My suggestion is to plan on reverting the systemd change to
KillUserProcess=no in F25, and reevaluate for F26, when we have a better understanding of just how many programs have to be rewritten/patched/forked to accommodate this new paradigm.

As I understand it, this option has been available for a while, the change in question is just to make it default. In which case the choice is between letting it become default, or just setting the default back to off in Fedora.

Nico Kadel-Garcia

6:27 p.m.

On Sat, Jul 9, 2016 at 5:09 PM, Ben Rosser rosser.bjr@gmail.com wrote:

...

On Sat, Jul 9, 2016 at 4:56 PM, Chris Murphy lists@colorremedies.com wrote:

...
I think this needs to be rethought. The options right now are, modify an as yet unknown quantity of background programs so they aren't killed on user logout; vs logout/restart/shutdown likely hanging for 90 seconds. It seems the work around would be to modify screen and tmux, and then run all such background tasks in either screen or tmux. But, that's kinda, wow... bit of a hammer.

A thought occurred to me: would it be possible to instead implement a whitelist of *binaries* that are allowed to linger, rather than going around patching everything? So for example rather than having to modify the codebase of screen, we have a (sysadmin-modifiable) whitelist that says /usr/bin/screen is allowed to linger? Perhaps this would be something shipped by the screen package, so /usr/bin/screen is only whitelisted if the package providing it installed.

This is pretty useless if systemd does no logging of having killed the process. That's the difference between managing system resources, and putting every backgrounded task on "double secret probation". It's also pretty useless for newly written shell scripts written in any language.

Ben Rosser

7:15 p.m.

On Sat, Jul 9, 2016 at 7:27 PM, Nico Kadel-Garcia nkadel@gmail.com wrote:

...

On Sat, Jul 9, 2016 at 5:09 PM, Ben Rosser rosser.bjr@gmail.com wrote:

...
On Sat, Jul 9, 2016 at 4:56 PM, Chris Murphy lists@colorremedies.com wrote:

...
I think this needs to be rethought. The options right now are, modify an as yet unknown quantity of background programs so they aren't killed on user logout; vs logout/restart/shutdown likely hanging for 90 seconds. It seems the work around would be to modify screen and tmux, and then run all such background tasks in either screen or tmux. But, that's kinda, wow... bit of a hammer.

A thought occurred to me: would it be possible to instead implement a whitelist of *binaries* that are allowed to linger, rather than going

around

...
patching everything? So for example rather than having to modify the codebase of screen, we have a (sysadmin-modifiable) whitelist that says /usr/bin/screen is allowed to linger? Perhaps this would be something shipped by the screen package, so /usr/bin/screen is only whitelisted if

the

...
package providing it installed.

This is pretty useless if systemd does no logging of having killed the process. That's the difference between managing system resources, and putting every backgrounded task on "double secret probation". It's also pretty useless for newly written shell scripts written in any language.

Well, the idea was that binaries shipped by Fedora that we *know* need to be whitelisted could have that information be part of the package that ships them, while admins could add whatever scripts they write themselves to a separate whitelist (that's what I meant by "sysadmin-modifiable"). But you're right, since systemd doesn't log what processes it kills there would be no way to implement such a thing at the moment.

Oh well.

Ben Rosser

Lennart Poettering

12 Jul 12 Jul

5:15 a.m.

On Sat, 09.07.16 17:09, Ben Rosser (rosser.bjr@gmail.com) wrote:

...

On Sat, Jul 9, 2016 at 4:56 PM, Chris Murphy lists@colorremedies.com wrote:

...
I think this needs to be rethought. The options right now are, modify an as yet unknown quantity of background programs so they aren't killed on user logout; vs logout/restart/shutdown likely hanging for 90 seconds. It seems the work around would be to modify screen and tmux, and then run all such background tasks in either screen or tmux. But, that's kinda, wow... bit of a hammer.

A thought occurred to me: would it be possible to instead implement a whitelist of *binaries* that are allowed to linger, rather than going around patching everything? So for example rather than having to modify the codebase of screen, we have a (sysadmin-modifiable) whitelist that says /usr/bin/screen is allowed to linger? Perhaps this would be something shipped by the screen package, so /usr/bin/screen is only whitelisted if the package providing it installed.

That's hardly useful, as "screen" alone is useless as it's just a frontend to other programs (such as a shell that is run inside the "screen" instance), and if we kill those, then "screen" doesn't need to be around either...

Lennart

-- Lennart Poettering, Red Hat

Przemek Klosowski

1:34 p.m.

On 07/12/2016 06:15 AM, Lennart Poettering wrote:

...

That's hardly useful, as "screen" alone is useless as it's just a frontend to other programs (such as a shell that is run inside the "screen" instance), and if we kill those, then "screen" doesn't need to be around either...

Right---the entire process trees were started by the user for some specific purpose, and this mechanism can't just arbitrarily kill parts of that tree, so, as you point out, the children of the 'whitelisted' processes would would have to inherit the immunity.

This shows why it's a difficult problem and also that we may be trying to discuss and solve it on the wrong level. The goal is to kill processes that have no business persisting, while leaving the useful ones---but how do we determine what should persist? We're trying to do some heuristics here, and I am not sure if they can be good enough.

Perhaps we should be looking at a different level, seeing the situation in terms of a desired function/objective rather than looking at individual processes; or having a different activation sequence ('run normally/ephemerally' vs 'run persistently'); or looking at the process behavior (kill everything that sits in select()). Then again, the behavior should depend on the device: different on a handheld, desktop and server.

Zbigniew Jędrzejewski-Szmek

13 Jul 13 Jul

2:56 a.m.

On Tue, Jul 12, 2016 at 02:34:04PM -0400, Przemek Klosowski wrote:

...

On 07/12/2016 06:15 AM, Lennart Poettering wrote:

...
That's hardly useful, as "screen" alone is useless as it's just a frontend to other programs (such as a shell that is run inside the "screen" instance), and if we kill those, then "screen" doesn't need to be around either...

Right---the entire process trees were started by the user for some specific purpose, and this mechanism can't just arbitrarily kill parts of that tree, so, as you point out, the children of the 'whitelisted' processes would would have to inherit the immunity.

This shows why it's a difficult problem and also that we may be trying to discuss and solve it on the wrong level. The goal is to kill processes that have no business persisting, while leaving the useful ones---but how do we determine what should persist? We're trying to do some heuristics here, and I am not sure if they can be good enough.

Perhaps we should be looking at a different level, seeing the situation in terms of a desired function/objective rather than looking at individual processes; or having a different activation sequence ('run normally/ephemerally' vs 'run persistently'); or looking at the process behavior (kill everything that sits in select()). Then again, the behavior should depend on the device: different on a handheld, desktop and server.

I have the feeling that by this the discussion has gone the full circle: after all setting K.U.P=yes and requiring an explicit systemd scope to stay around is a way to state the intent of being a permanent process. An explicit request is a more robust way than any heuristics.

A big implementation detail is whether screen/tmux/etc should do it themselves. In principle this isn't required, and those programs can e.g. be invoked using systemd-run explicitly. Nevertheless, I think that from the usability perspective it is much better if such programs do this on their own. Of course from the systemd side we should provide a simple and easy mechanism that can be used with minimal fuss. That is my intent, and I think that the knee-jerk reaction on the tmux bugtracker was completely premature. If we have such functionality in screen/tmux/etc, expressing the intent should be a matter of simple switch.

A lot of people seem to think that some mechanism to state such intent would be OK. We probably cannot make it fully automatic, but we should be able cover most cases where people want to things to stay around after logout. There always will be other cases which would require changes to how people call some programs, but hopefully the advantages in the common case will outweigh this disruption.

Zbyszek

Nico Kadel-Garcia

9 Jul 9 Jul

6:40 p.m.

On Sat, Jul 9, 2016 at 4:46 PM, Zbigniew Jędrzejewski-Szmek zbyszek@in.waw.pl wrote:

...

On Sat, Jul 09, 2016 at 07:32:01AM -0400, Nico Kadel-Garcia wrote:

...
On Thu, Jul 7, 2016 at 8:13 AM, Jan Kurik jkurik@redhat.com wrote:

...
= Proposed System Wide Change: KillUserProcesses=yes by default = https://fedoraproject.org/wiki/Changes/KillUserProcesses_by_default

Change owner(s):

Zbigniew Jędrzejewski-Szmek zbyszek@in.waw.pl

Set the default policy to terminate processes in session scope when the user logs out. Specifically, systemd-logind's KillUserProcesses setting, which currently is set to "no" to override the upstream default, will be removed to follow the upstream default of "yes".

We already discussed this idea on this mailing list. It's a *horrible* idea. It breaks screen, nohup processes and all backgrounded tasks,

Right, the next paragraph that you helpfully snipped, talks about changing screen to automatically register itself with systemd to avoid being killed. So let's discuss the change as proposed, with the assumption that we modify common run-stuff-in-the-background-on-purpose-style programs so that they continue to work as expected.

[...]

...
For a larger environment, it still shouldn't be killing the tasks automatically, That's what scheduled nightly reboots, or nightly audits and autokills with user email notifications are for.

That sounds like a much worse solution in every regard — because the issue of having to mark processes to be exempt from killing is still present, but the process to get rid of unwanted processes is asynchronous, heavyweight, nonstandard, and requires a lot of admin engagement. But if you have this kind of setup in place, then simply set KillUserProcesses=no and carry on.

They're solutions for environments that need to really disconnect and scrub away dangling users. The scheduled nightly reboots, even reinstalls of systems left unused for 15 minutes, has been used for network appliances deployed in public kiosks, student clusters, and many teaching laboratories for decades.

The nightly cron job to report first, and later slap processes for disconnected users are ones I've used for lab systems. My favorite dangling processes to kill have been lengthy MySQL and Postgresql queries. I *do not* want to just kill those silently in systemd, that's the sort of "security feature" that costs people their jobs for updating to the latest version of an operating system and discovering something broke normal operations.

...

But if you have this kind of setup in place, then simply set KillUserProcesses=no and carry on.

Please don't burn the cycles of admins who have better work to do by braking the expectations and experience of their multi user environment by introducing what acts a lot like malware.

Stephen John Smoogen

6:48 p.m.

On 9 July 2016 at 19:40, Nico Kadel-Garcia nkadel@gmail.com wrote:

...

On Sat, Jul 9, 2016 at 4:46 PM, Zbigniew Jędrzejewski-Szmek zbyszek@in.waw.pl wrote:

...

...
But if you have this kind of setup in place, then simply set KillUserProcesses=no and carry on.

Please don't burn the cycles of admins who have better work to do by braking the expectations and experience of their multi user environment by introducing what acts a lot like malware. -- devel mailing list devel@lists.fedoraproject.org https://lists.fedoraproject.org/admin/lists/devel@lists.fedoraproject.org

The change is proposed. It will not get decided in this mailing list discussion no matter how many emails everyone puts one way or another. Bring it up with FESCO and get it decided there. If the goal of the OS is to break things quickly and often and move on then this change will be tried and if it doesn't work in F25 then it will be changed to something else afterwords. If the goal of the OS is to be more grognard friendly then the change will be worked on in a way that makes it easier to deal with before being implemented (or not).

In either case it will be up to FESCO to decide and set guidelines on implementation and for us grognards to either deal with the change or go find an OS we can be happier in.

-- Stephen J Smoogen.

Chris Adams

8:04 p.m.

Once upon a time, Stephen John Smoogen smooge@gmail.com said:

...

The change is proposed.

As I understand it, a change proposal should have some concrete plans, not just a "fix some stuff (we don't know what or how yet)". Isn't there supposed to be at least some outline of what's involved?

-- Chris Adams linux@cmadams.net

Nico Kadel-Garcia

8:20 p.m.

On Sat, Jul 9, 2016 at 7:48 PM, Stephen John Smoogen smooge@gmail.com wrote:

...

On 9 July 2016 at 19:40, Nico Kadel-Garcia nkadel@gmail.com wrote:

...
On Sat, Jul 9, 2016 at 4:46 PM, Zbigniew Jędrzejewski-Szmek zbyszek@in.waw.pl wrote:

s.

...
...
But if you have this kind of setup in place, then simply set KillUserProcesses=no and carry on.

Please don't burn the cycles of admins who have better work to do by braking the expectations and experience of their multi user environment by introducing what acts a lot like malware. -- devel mailing list devel@lists.fedoraproject.org https://lists.fedoraproject.org/admin/lists/devel@lists.fedoraproject.org

The change is proposed. It will not get decided in this mailing list discussion no matter how many emails everyone puts one way or another. Bring it up with FESCO and get it decided there. If the goal of the OS is to break things quickly and often and move on then this change will be tried and if it doesn't work in F25 then it will be changed to something else afterwords. If the goal of the OS is to be more grognard friendly then the change will be worked on in a way that makes it easier to deal with before being implemented (or not).

In either case it will be up to FESCO to decide and set guidelines on implementation and for us grognards to either deal with the change or go find an OS we can be happier in.

It looks to me like the critical change to even consider activating this dangerous policy is to *log* the killing of userland processed. Date, euid, guid, and pid are a minimum: the name of the process would be even better, and the contents of the process invocation command line would be even more useful.

Can systemd even gracefully poll for that information at the time of killing these processes? Or would systemd developers feel a need to re-invent "ps" from scratch to report this?

Lennart Poettering

12 Jul 12 Jul

5:33 a.m.

On Sat, 09.07.16 21:20, Nico Kadel-Garcia (nkadel@gmail.com) wrote:

...

...
In either case it will be up to FESCO to decide and set guidelines on implementation and for us grognards to either deal with the change or go find an OS we can be happier in.

It looks to me like the critical change to even consider activating this dangerous policy is to *log* the killing of userland processed. Date, euid, guid, and pid are a minimum: the name of the process would be even better, and the contents of the process invocation command line would be even more useful.

Can systemd even gracefully poll for that information at the time of killing these processes? Or would systemd developers feel a need to re-invent "ps" from scratch to report this?

I figure it would be OK to add code to systemd that logs about all processes we kill with SIGKILL and all processes we kill after a "scope" unit is "abandoned".

(Regarding the terms used above: In systemd "scope" units are a concept how groups of processes not started by PID 1 are maintained, very similar to a "service" unit, the only difference being that "services" are forked off by PID 1 itself, while "scopes" are started by other code. Login sessions are maintained in "scopes" as it is not systemd that starts their processes but getty/gdm/... And "abandoning" a "scope" is what happens when the process that created the "scope" goes away before the "scope" itself goes away. This is what happens to the login "scopes" as soon as gdm/getty/... consider the session having ended.)

I think logging about all processes we send signals to (i.e. SIGTERM) would be too much, as this pretty typically happens all the time, for example when a service is terminated. Logging about SIGKILL and abandoned scope process is different however, as in that case the processes conceptually are "left over", as the clean shutdown logic (which is SIGTERM, or the scope's owner shutting it down propery) apparently didn't work.

Lennart

-- Lennart Poettering, Red Hat

Chris Murphy

8:56 a.m.

I have KillUserProcesses=yes set in Fedora 24 for some time now. I'm noticing that I still often have 90 second delays if I restart or shutdown, more than half the time. If I log out, I always get fast log outs, and more often than not a restart/shutdown from the login window is also fast. So there's something different about directly choosing restart/shutdown inside the user login session, versus logging out first then restarting/shutting down. I have no idea how to collect more information on why I'm experiencing this.

--- Chris Murphy

Garry Williams

13 Jul 13 Jul

10:11 p.m.

On Tuesday, July 12, 2016 7:56:41 AM EDT Chris Murphy wrote:

...

I have KillUserProcesses=yes set in Fedora 24 for some time now. I'm noticing that I still often have 90 second delays if I restart or shutdown, more than half the time.

Yup. Me too.

[snip]

...

I have no idea how to collect more information on why I'm experiencing this.

Yup. Me too.

-- Garry T. Williams

Tomasz Torcz

14 Jul 14 Jul

5:17 a.m.

On Wed, Jul 13, 2016 at 11:11:12PM -0400, Garry Williams wrote:

...

On Tuesday, July 12, 2016 7:56:41 AM EDT Chris Murphy wrote:

...
I have KillUserProcesses=yes set in Fedora 24 for some time now. I'm noticing that I still often have 90 second delays if I restart or shutdown, more than half the time.

Yup. Me too.

...
I have no idea how to collect more information on why I'm experiencing this.

Yup. Me too.

Have you tried https://freedesktop.org/wiki/Software/systemd/Debugging/#index2h1 ?

-- Tomasz Torcz There exists no separation between gods and men: xmpp: zdzichubg@chrome.pl one blends softly casual into the other.

Chris Murphy

11:06 a.m.

On Thu, Jul 14, 2016 at 4:17 AM, Tomasz Torcz tomek@pipebreaker.pl wrote:

...

On Wed, Jul 13, 2016 at 11:11:12PM -0400, Garry Williams wrote:

...
On Tuesday, July 12, 2016 7:56:41 AM EDT Chris Murphy wrote:

...
I have KillUserProcesses=yes set in Fedora 24 for some time now. I'm noticing that I still often have 90 second delays if I restart or shutdown, more than half the time.

Yup. Me too.

...
I have no idea how to collect more information on why I'm experiencing this.

Yup. Me too.

Have you tried https://freedesktop.org/wiki/Software/systemd/Debugging/#index2h1 ?

Yes I have and there is only 90 gap in the journal where nothing is even recorded. What I know is when there is this hang, it's due to a stop job waiting on the user session. I don't know why a direct restart while still logged in appears to always result in a hang; while log out immediately works, and subsequent restart from GDM immediately works.

The hang within gnome-shell on a restart or shutdown request is actually a brutal hang. I have no keyboard control, the entire UI is locked up, I can't get to a console. This is a clean installation of Fedora 24, it's not an upgrade.

-- Chris Murphy

Chris Murphy

11:08 a.m.

On Thu, Jul 14, 2016 at 10:06 AM, Chris Murphy lists@colorremedies.com wrote:

...

On Thu, Jul 14, 2016 at 4:17 AM, Tomasz Torcz tomek@pipebreaker.pl wrote:

...
On Wed, Jul 13, 2016 at 11:11:12PM -0400, Garry Williams wrote:

...
On Tuesday, July 12, 2016 7:56:41 AM EDT Chris Murphy wrote:

...
I have KillUserProcesses=yes set in Fedora 24 for some time now. I'm noticing that I still often have 90 second delays if I restart or shutdown, more than half the time.

Yup. Me too.

...
I have no idea how to collect more information on why I'm experiencing this.

Yup. Me too.

Have you tried https://freedesktop.org/wiki/Software/systemd/Debugging/#index2h1 ?

Yes I have and there is only 90 gap in the journal where nothing is even recorded. What I know is when there is this hang, it's due to a stop job waiting on the user session. I don't know why a direct restart while still logged in appears to always result in a hang; while log out immediately works, and subsequent restart from GDM immediately works.

The hang within gnome-shell on a restart or shutdown request is actually a brutal hang. I have no keyboard control, the entire UI is locked up, I can't get to a console. This is a clean installation of Fedora 24, it's not an upgrade.

Ergo, from my perspective, the feature of KillUserProcesses=yes doesn't even really completely work at present. It's not solving the problem it's intended to solve, unless I first log out and then restart, which is sorta ick. Not quite pointless, but fairly pointless.

Chris Murphy

Nico Kadel-Garcia

15 Jul 15 Jul

7:55 a.m.

On Tue, Jul 12, 2016 at 6:33 AM, Lennart Poettering mzerqung@0pointer.de wrote:

...

On Sat, 09.07.16 21:20, Nico Kadel-Garcia (nkadel@gmail.com) wrote:

...
...
In either case it will be up to FESCO to decide and set guidelines on implementation and for us grognards to either deal with the change or go find an OS we can be happier in.

It looks to me like the critical change to even consider activating this dangerous policy is to *log* the killing of userland processed. Date, euid, guid, and pid are a minimum: the name of the process would be even better, and the contents of the process invocation command line would be even more useful.

Can systemd even gracefully poll for that information at the time of killing these processes? Or would systemd developers feel a need to re-invent "ps" from scratch to report this?

I figure it would be OK to add code to systemd that logs about all processes we kill with SIGKILL and all processes we kill after a "scope" unit is "abandoned".

I'm glad you've commented on the thread. I admit that I was personally surprised to find that such a feature had been activated without logging.

Would it be reasonable or feasible to activate a "WARNING" level for UserKillProcess, similar to that used by SELinux? For an admin considering this feature, it could be invaluable to generate a day or week of logs about which processes *wouild* have been killed. I'm particularly thinking of some hand-run backup tools used by former colleagues, tools that used all manner of MySQL, Postgresql, rsync, dump, tar, and scattered backup tools run manually as opportunities occurred.

...

(Regarding the terms used above: In systemd "scope" units are a concept how groups of processes not started by PID 1 are maintained, very similar to a "service" unit, the only difference being that "services" are forked off by PID 1 itself, while "scopes" are started by other code. Login sessions are maintained in "scopes" as it is not systemd that starts their processes but getty/gdm/... And "abandoning" a "scope" is what happens when the process that created the "scope" goes away before the "scope" itself goes away. This is what happens to the login "scopes" as soon as gdm/getty/... consider the session having ended.)

I think logging about all processes we send signals to (i.e. SIGTERM) would be too much, as this pretty typically happens all the time, for example when a service is terminated. Logging about SIGKILL and

From your explanation, I think you're correct. I'll note that reporting "SIGTERM" operations might be useful as an admin selectable debugging uption, I don't have a good sense of how much it would spew into the logs. Might it be useful as a debugging option? Do you need or want a feature request for that?

...

abandoned scope process is different however, as in that case the processes conceptually are "left over", as the clean shutdown logic (which is SIGTERM, or the scope's owner shutting it down propery) apparently didn't work.

Please note that my personal concern is processes for which logging out or losing a login connection *should not* shut down the process. Whitelisting them seems infeasible, and modigying them all to work well with KillUserProcess quickly becomes a herculean task. Just thinking of my work in the last few years, they include "dd", "rsync", "tar", "mysql" and its related commands, "psql" and its related commands, "gzip" and all its variants, "xz" and all its variants, "bzip2" and all its variants", "mock", "koji", and "make".

Lst: I'm afraid the list also includes the wrapper "nohup", which many of us use to log long-running tasks. It's especially useful when we don't want to incur the overhead of using "screen" or "tmux", and of leaving those dangling sessions. And let's be honest: as soon as "nohup" is effectively whitelisted. the game is pretty much over. The most system abusive processes, exactly those for which KillUserProcess is most effective, can typically be wrapped with nohup.

...

Lennart

-- Lennart Poettering, Red Hat -- devel mailing list devel@lists.fedoraproject.org https://lists.fedoraproject.org/admin/lists/devel@lists.fedoraproject.org

Lennart Poettering

11:34 a.m.

On Fri, 15.07.16 08:55, Nico Kadel-Garcia (nkadel@gmail.com) wrote:

...

On Tue, Jul 12, 2016 at 6:33 AM, Lennart Poettering mzerqung@0pointer.de wrote:

...
On Sat, 09.07.16 21:20, Nico Kadel-Garcia (nkadel@gmail.com) wrote:

...
...
In either case it will be up to FESCO to decide and set guidelines on implementation and for us grognards to either deal with the change or go find an OS we can be happier in.

It looks to me like the critical change to even consider activating this dangerous policy is to *log* the killing of userland processed. Date, euid, guid, and pid are a minimum: the name of the process would be even better, and the contents of the process invocation command line would be even more useful.

Can systemd even gracefully poll for that information at the time of killing these processes? Or would systemd developers feel a need to re-invent "ps" from scratch to report this?

I figure it would be OK to add code to systemd that logs about all processes we kill with SIGKILL and all processes we kill after a "scope" unit is "abandoned".

I'm glad you've commented on the thread. I admit that I was personally surprised to find that such a feature had been activated without logging.

Would it be reasonable or feasible to activate a "WARNING" level for UserKillProcess, similar to that used by SELinux? For an admin considering this feature, it could be invaluable to generate a day or week of logs about which processes *wouild* have been killed. I'm particularly thinking of some hand-run backup tools used by former colleagues, tools that used all manner of MySQL, Postgresql, rsync, dump, tar, and scattered backup tools run manually as opportunities occurred.

We can meet in the middle and make this LOG_NOTICE. That's not the usual LOG_INFO, but also not the higher LOG_WARNING.

Lennart

-- Lennart Poettering, Red Hat

Nico Kadel-Garcia

16 Jul 16 Jul

4:14 p.m.

On Fri, Jul 15, 2016 at 12:34 PM, Lennart Poettering mzerqung@0pointer.de wrote:

...

On Fri, 15.07.16 08:55, Nico Kadel-Garcia (nkadel@gmail.com) wrote:

...
On Tue, Jul 12, 2016 at 6:33 AM, Lennart Poettering mzerqung@0pointer.de wrote:

...
On Sat, 09.07.16 21:20, Nico Kadel-Garcia (nkadel@gmail.com) wrote:

...
...
In either case it will be up to FESCO to decide and set guidelines on implementation and for us grognards to either deal with the change or go find an OS we can be happier in.

It looks to me like the critical change to even consider activating this dangerous policy is to *log* the killing of userland processed. Date, euid, guid, and pid are a minimum: the name of the process would be even better, and the contents of the process invocation command line would be even more useful.

Can systemd even gracefully poll for that information at the time of killing these processes? Or would systemd developers feel a need to re-invent "ps" from scratch to report this?

I figure it would be OK to add code to systemd that logs about all processes we kill with SIGKILL and all processes we kill after a "scope" unit is "abandoned".

I'm glad you've commented on the thread. I admit that I was personally surprised to find that such a feature had been activated without logging.

Would it be reasonable or feasible to activate a "WARNING" level for UserKillProcess, similar to that used by SELinux? For an admin considering this feature, it could be invaluable to generate a day or week of logs about which processes *wouild* have been killed. I'm particularly thinking of some hand-run backup tools used by former colleagues, tools that used all manner of MySQL, Postgresql, rsync, dump, tar, and scattered backup tools run manually as opportunities occurred.

We can meet in the middle and make this LOG_NOTICE. That's not the usual LOG_INFO, but also not the higher LOG_WARNING.

Lennart

Just to verify: I assume you mean that the killing of these processes would normally emit a "LOG_NOTICE". message. This makes me happier because it produces *some* kind of log. It's pretty scary to just kill user processes with no long whatsoever: this is a positive step.

Also to be clear: What about having an alternative "WARNING" setting for UserKillProcess, one that would generate a log message but not actually kill processes? That would help developers and admins to run test systems for some reasonable time, such as a week or so, to audit for processes that people leave dangling and for which the rocesses need modification for compatibily with UserKillProcess.

It could also be invaluable for admins faced with software, like "tux", for which the upstream authors seem unwilling to provide compatibility with UserKillProcess, but which an admin may nevertheless want to collect a daily or weekly report of dangling sessions. It's been pesky to write the necessary cron jobs to generate such reports. I'd be delighted to have these dangling processes listed, without actually killing them automatically.

Björn Persson

5:17 p.m.

Nico Kadel-Garcia wrote:

...

On Fri, Jul 15, 2016 at 12:34 PM, Lennart Poettering mzerqung@0pointer.de wrote:

...
We can meet in the middle and make this LOG_NOTICE. That's not the usual LOG_INFO, but also not the higher LOG_WARNING.

Just to verify: I assume you mean that the killing of these processes would normally emit a "LOG_NOTICE". message.

Do I understand correctly that KillUserProcesses is meant to be a safety net to catch processes that should have terminated when the user logged out, but failed to do so? In that case each killed process should be logged on the error level, not notice, because it's not a normal event. It's a defect in the program that it failed to terminate cleanly, and to draw attention to the defect and get it fixed, it should be logged as an error.

Or is KillUserProcesses meant to become the normal way of ending a user session? Are programs supposed to start relying on receiving a TERM signal from SystemD to tell them that the user has logged out? In that case it's a perfectly normal an uninteresting event, and the processes shouldn't be logged by default on any level higher than debug – but if a process fails to terminate and gets a KILL, then it should be logged on the error level, because – again – it's a defect in the program that it failed to terminate on the TERM signal.

...

Also to be clear: What about having an alternative "WARNING" setting for UserKillProcess, one that would generate a log message but not actually kill processes?

That will certainly be needed during the transition, and when that setting is in effect, then it seems reasonable to log on the notice level.

Björn Persson

Zbigniew Jędrzejewski-Szmek

19 Jul 19 Jul

3:28 p.m.

On Sun, Jul 17, 2016 at 12:17:49AM +0200, Björn Persson wrote:

...

Nico Kadel-Garcia wrote:

...
On Fri, Jul 15, 2016 at 12:34 PM, Lennart Poettering mzerqung@0pointer.de wrote:

...
We can meet in the middle and make this LOG_NOTICE. That's not the usual LOG_INFO, but also not the higher LOG_WARNING.

Just to verify: I assume you mean that the killing of these processes would normally emit a "LOG_NOTICE". message.

Do I understand correctly that KillUserProcesses is meant to be a safety net to catch processes that should have terminated when the user logged out, but failed to do so?

Yes, we usually expect user processes to exit on their own. But it's quite likely that this kind of mistake will happen quite often. Also, some users might simply take advantage of the fact that this safety net is present and leave processes around. Either way, it's not very clearly cut, and logging at error level would probably be quite annoying.

But adjusting the log level is a very simple change, so we can start at notice level (which by default will end up in logs, but will not be too obnoxious), and adjust up (if in the common case we get no output) or down (if in the common case we get too many logs).

Zbyszek

Chris Murphy

29 Jul 29 Jul

12:31 p.m.

Can anyone explain why the feature works for Logout, but doesn't work for Restart or Shutdown when initiated in the logged in shell session?

KillUserProcesses true does not kill user gdm session on restart, restart hangs 1m30s https://bugzilla.redhat.com/show_bug.cgi?id=1341837

I'm not running anything exotic. But from my perspective the feature isn't working as designed, and in the meantime is going to burden other processes with changes that are in the interim pointless if this turns out to not be an effective way of solving the problem.

The other question is whether sudo should be modified to put privilege escalated processes into a different session so that they aren't killed? A privilege escalated program doesn't really seem to me it ought to be found in my user session anyway.

Chris Murphy

12:44 p.m.

On Fri, Jul 29, 2016 at 11:31 AM, Chris Murphy lists@colorremedies.com wrote:

...

Can anyone explain why the feature works for Logout, but doesn't work for Restart or Shutdown when initiated in the logged in shell session?

KillUserProcesses true does not kill user gdm session on restart, restart hangs 1m30s https://bugzilla.redhat.com/show_bug.cgi?id=1341837

The ticket is closed but I've supplied my opinion anyway. https://fedorahosted.org/fesco/ticket/1600#comment:19

Maybe someone can beat me to a test involving pvmove from one disk to another, initiated in GNOME Terminal, and logging out before it completes. I'd love to know what state it puts the system in...

There must be dozens of examples, not yet discovered or provided, where the task is expected to take hours or days, and the admin properly logs out of the shell before that task is finished.

What about remote tasks? root is exempt from KillUserProcesses, and I typically disable root on all systems, especially servers with ssh enabled all the time. So I login as chris, and use sudo to run those tasks. Are they going to get clobbered when I exit from that remote session? I suspect they would, so maybe this is not an appropriate default for server and cloud products?

-- Chris Murphy

Chris Adams

3:36 p.m.

Once upon a time, Chris Murphy lists@colorremedies.com said:

...

Maybe someone can beat me to a test involving pvmove from one disk to another, initiated in GNOME Terminal, and logging out before it completes. I'd love to know what state it puts the system in...

The actual work of pvmove is not done by the command you run; that sets it up and it is run in the background (by a kernel thread). All the command you run does then is periodically check and print a percentage done.

-- Chris Adams linux@cmadams.net

Chris Murphy

30 Jul 30 Jul

11:37 a.m.

On Fri, Jul 29, 2016 at 2:36 PM, Chris Adams linux@cmadams.net wrote:

...

Once upon a time, Chris Murphy lists@colorremedies.com said:

...
Maybe someone can beat me to a test involving pvmove from one disk to another, initiated in GNOME Terminal, and logging out before it completes. I'd love to know what state it puts the system in...

The actual work of pvmove is not done by the command you run; that sets it up and it is run in the background (by a kernel thread). All the command you run does then is periodically check and print a percentage done.

It's the same with btrfs balance and scrub. It may be the operation completes by kernel code, but with user space detached from the kill, the status/statistics are lost.

-- Chris Murphy

Chris Adams

11:44 a.m.

Once upon a time, Chris Murphy lists@colorremedies.com said:

...

On Fri, Jul 29, 2016 at 2:36 PM, Chris Adams linux@cmadams.net wrote:

...
The actual work of pvmove is not done by the command you run; that sets it up and it is run in the background (by a kernel thread). All the command you run does then is periodically check and print a percentage done.

It's the same with btrfs balance and scrub. It may be the operation completes by kernel code, but with user space detached from the kill, the status/statistics are lost.

I don't know about btrfs, but with LVM, nothing is lost. You can run "pvmove" at any time to continue to show the status.

And we're talking about KillUserProcesses; logging out _already_ killed the pvmove command. Nothing has changed.

-- Chris Adams linux@cmadams.net

Chris Murphy

3:10 p.m.

On Sat, Jul 30, 2016 at 10:44 AM, Chris Adams linux@cmadams.net wrote:

...

Once upon a time, Chris Murphy lists@colorremedies.com said:

...
On Fri, Jul 29, 2016 at 2:36 PM, Chris Adams linux@cmadams.net wrote:

...
The actual work of pvmove is not done by the command you run; that sets it up and it is run in the background (by a kernel thread). All the command you run does then is periodically check and print a percentage done.

It's the same with btrfs balance and scrub. It may be the operation completes by kernel code, but with user space detached from the kill, the status/statistics are lost.

I don't know about btrfs, but with LVM, nothing is lost. You can run "pvmove" at any time to continue to show the status.

I see the same behavior. I'm not sure what the user space tool reattaches to, something in sysfs or lvmetad?

...

And we're talking about KillUserProcesses; logging out _already_ killed the pvmove command. Nothing has changed.

It doesn't go to the background by default, where btrfs scrub, balance, and replace do. At least with scrub, the process continues to work when KillUserProcesses=no and I logout; where if it's set to yes, the process goes from status S to Z, and shortly after that the kernel threads stop working also. I can't actually tell if this is just an accounting problem, or if the scrub really is interrupted.

Guess we'll see what btrfs upstream thinks about it, but the idea that Btrfs users are going to know their scrubs fail due to this feature is flawed. There's zero indication why the process dies, and there's no way to get more information in the journal to hint at why it dies. Plus it's inconsistent. Neither btrfs balance nor replace have this problem, those processes continue to run with status D and R until they complete.

systemd KillUserProcesses=yes and btrfs scrub https://bugzilla.kernel.org/show_bug.cgi?id=150781

Anyway, the feature still strikes me as merely exchanging problems we know for problems we don't know. It's rather a lot like punting.

-- Chris Murphy

Tomasz Torcz

4:11 p.m.

On Sat, Jul 30, 2016 at 02:10:32PM -0600, Chris Murphy wrote:

...

On Sat, Jul 30, 2016 at 10:44 AM, Chris Adams linux@cmadams.net wrote:

...
Once upon a time, Chris Murphy lists@colorremedies.com said:

...
On Fri, Jul 29, 2016 at 2:36 PM, Chris Adams linux@cmadams.net wrote:

...
The actual work of pvmove is not done by the command you run; that sets it up and it is run in the background (by a kernel thread). All the command you run does then is periodically check and print a percentage done.

It's the same with btrfs balance and scrub. It may be the operation completes by kernel code, but with user space detached from the kill, the status/statistics are lost.

I don't know about btrfs, but with LVM, nothing is lost. You can run "pvmove" at any time to continue to show the status.

I see the same behavior. I'm not sure what the user space tool reattaches to, something in sysfs or lvmetad?

...
And we're talking about KillUserProcesses; logging out _already_ killed the pvmove command. Nothing has changed.

Guess we'll see what btrfs upstream thinks about it, but the idea that Btrfs users are going to know their scrubs fail due to this feature is flawed. There's zero indication why the process dies, and there's no way to get more information in the journal to hint at why it dies. Plus it's inconsistent. Neither btrfs balance nor replace have this problem, those processes continue to run with status D and R until they complete.

If KillUserProcessess is on, systemd logs when cleanup happens (in v231+). It is up to admin to connect the dots. Personally I'm using following unit:

--- [Unit] Description=btrfs scrub of %I

[Service] ExecStart=/usr/sbin/btrfs scrub start -B -d %I ExecReload=/usr/sbin/btrfs scrub status %I ExecStop=-/usr/sbin/btrfs scrub cancel %I IOSchedulingClass=idle BlockIOWeight=128 PrivateDevices=yes PrivateNetwork=yes PrivateTmp=yes ---

Integrates nicely with timers.

...

systemd KillUserProcesses=yes and btrfs scrub https://bugzilla.kernel.org/show_bug.cgi?id=150781

Why do you require systemd v230? KillUserProcess exists for 5 years already, it should work the same with all systemd. Has anything changed in 230?

-- Tomasz Torcz Morality must always be based on practicality. xmpp: zdzichubg@chrome.pl -- Baron Vladimir Harkonnen

Chris Murphy

1 Aug 1 Aug

1:30 p.m.

On Sat, Jul 30, 2016 at 3:11 PM, Tomasz Torcz tomek@pipebreaker.pl wrote:

...

If KillUserProcessess is on, systemd logs when cleanup happens (in v231+). It is up to admin to connect the dots.

Yep, it is different with systemd v231.

systemd[1]: user@1000.service: Killing process 4866 (btrfs) with signal SIGKILL.

What's interesting is that 'btrfs scrub' process goes from status S to Z. However, 'btrfs balance', even though it receives the same SIGKILL from systemd, remains running with status R and D. And so does 'btrfs replace'. The session is definitely gone per loginctl, but the process started in that user session is still running for the entire balance and replace. This took several minutes, so I don't think it's just some kind of delayed death. Maybe only processes with status S are subject to SIGKILL, and R and D can ignore it? *shrug*

Both top and ps report that the user for these processes is root, which makes sense because I used sudo to run them. However, the default KillExcludeUsers=root appears to not apply to user ownership of the process but rather user session. So even if I were to sudo -i, or use su -c, and then exit the DE, apparently those processes are subject to SIGKILL. That's questionable.

...

...
systemd KillUserProcesses=yes and btrfs scrub https://bugzilla.kernel.org/show_bug.cgi?id=150781

Why do you require systemd v230? KillUserProcess exists for 5 years already, it should work the same with all systemd. Has anything changed in 230?

That's misleading. The logic was that it's the default starting with systemd v230. I updated the bug.

-- Chris Murphy

John Florian

29 Jul 29 Jul

1:05 p.m.

On Fri, 2016-07-29 at 11:31 -0600, Chris Murphy wrote:

...

Can anyone explain why the feature works for Logout, but doesn't work for Restart or Shutdown when initiated in the logged in shell session?

Aha! I manually enabled this for F23 hoping to eliminate the 90s wait I seem to get on every exit. It seemed to rarely have any benefit though and now I see why: I typically only use the reboot feature!

-- John Florian john.florian@dart.biz

Björn Persson

13 Jul 13 Jul

1:28 p.m.

In my opinion the proposal needs to be amended in the following ways:

Scope:

Understanding the scope of this Change requires understanding how many programs there are that will have to be adapted to avoid getting killed. Therefore the Scope section should contain a complete list of affected packages. It would also be good to list known affected programs that aren't packaged in Fedora, as users may be using them.

Currently not even all of the programs that were mentioned in the first email thread are listed. I suspect that there are more, maybe many more.

How To Test:

This section says only: "User processes should be terminated when a user session ends. Services which take the steps to stay around should stay around."

That's how things have always been *supposed* to work. To verify that KillUserProcesses actually works, a tester needs a program that is supposed to terminate with the user session, but doesn't, so that they can check that SystemD kills the program successfully.

Contingency Plan:

It should be stated under what circumstances the contingency plan will be activated. If KillUserProcesses itself works as intended, but none of the affected programs have been adapted to not break, will Fedora 25 then be released with these programs broken, or will KillUserProcesses be changed back to "no"? What if only half of the affected programs have been adapted? Or all but one of them?

Release Notes:

Sysadmins need to be made aware that any in-house-written or otherwise locally installed persistent programs they might have will stop working, so a release note is quite important.

Björn Persson

Andrew Lutomirski

2:04 p.m.

On Wed, Jul 13, 2016 at 11:28 AM, Björn Persson <Bjorn@rombobjörn.se> wrote:

...

In my opinion the proposal needs to be amended in the following ways:

Scope:

Understanding the scope of this Change requires understanding how many programs there are that will have to be adapted to avoid getting killed. Therefore the Scope section should contain a complete list of affected packages. It would also be good to list known affected programs that aren't packaged in Fedora, as users may be using them.

Currently not even all of the programs that were mentioned in the first email thread are listed. I suspect that there are more, maybe many more.

I want to add a couple more to the scope: dnf and PackageKit. I don't care *how* dnf and PackageKit get started. If they're making changes, systemd should *not* zap them on logout.

Colin Walters

14 Jul 14 Jul

8:14 a.m.

On Wed, Jul 13, 2016, at 03:04 PM, Andrew Lutomirski wrote:

...

I want to add a couple more to the scope: dnf and PackageKit. I don't care *how* dnf and PackageKit get started. If they're making changes, systemd should *not* zap them on logout.

PackageKit has been a daemon from the start (and this has been a long-running major technological difference from yum). For precisely this reason, you wouldn't want logging out of your desktop to break updates in the middle.

rpm-ostree is also a daemon, and what's notable here is unlike the dnf/PackageKit situation, you can't even use rpm -Uvh, so there is only one way to mutate the host system that is consistently using a daemon.

Stephen Gallagher

7:35 a.m.

On 07/07/2016 08:13 AM, Jan Kurik wrote:

...

= Proposed System Wide Change: KillUserProcesses=yes by default = https://fedoraproject.org/wiki/Changes/KillUserProcesses_by_default

Change owner(s):

Zbigniew Jędrzejewski-Szmek zbyszek@in.waw.pl

<snip>

Copying my response from https://fedorahosted.org/fesco/ticket/1600#comment:4

As the Change Proposal stands today, I'm wouldn't be willing to approve it. So let's figure out what it would take to get there:

* I want to see a crowdsourced list of all the packages known to be negatively affected by this change. (This should be possible by investigating the various email threads on the issue).

* The list of affected packages should be divided into two categories by FESCo: * Tier 1 packages must be ported to support operation under KillUserProcesses=yes by the Contingency Deadline. If any of the Tier 1 packages are still not migrated by that point, we return to the KillUserProcesses=no default until Fedora 26. * Tier 2 packages are non-blocking for this effort. They will be tracked as bugs (probably also listed on the F25 Common Bugs page), but a failure to meet this deadline is not sufficient to prevent us from switching.

* I'd like to see the Contingency Deadline set at Beta Freeze.

* I would like to see a guideline written up on the wiki and added to the Change Proposal on some of the ways that the porting could be done (such as converting tools to services, using the allow-linger functionality, etc.)

* I'd like the package maintainers of any package FESCo declares Tier 1 to be added to the Change Proposal as an owner.

2833

Age (days ago)

2858

Last active (days ago)

devel@lists.fedoraproject.org

43 comments

20 participants

tags (0)

participants (20)

Adam Williamson
Alex Thomas
Andrew Lutomirski
Ben Rosser
Björn Persson
Chris Adams
Chris Murphy
Colin Walters
Dan Book
Garrett Holmstrom
Garry Williams
Jan Kurik
John Florian
Lennart Poettering
Nico Kadel-Garcia
Przemek Klosowski
Stephen Gallagher
Stephen John Smoogen
Tomasz Torcz
Zbigniew Jędrzejewski-Szmek