(cross-posting to devel and desktop lists, ideally reply to both)
Hello,
this is a request for feedback regarding adjusting process limits to make Steam/Wine work better on Fedora.
*A quick background:* In August Valve announced [1] Proton [2], their own fork of Wine, to get included in their Steam Play functionality, and allowing Linux players to run Windows-only games transparently from Steam without any complex configuration, just with a press of a button. Not everything works of course, but the success rate is more than decent [3]. As far as I can tell, the reception by Linux gamers has been *very* enthusiastic.
*The technical details:* In order to boost performance, Valve included DXVK [4] which converts DirectX calls to Vulkan (instead of OpenGL in vanilla Wine), and esync [5], an existing wine patchset that performs process synchronization in a more efficient manner than vanilla Wine, using file descriptors. You can read the linked readme for full explanation.
The esync patchset is the main subject of this email. It uses a lot of file descriptors and the default kernel limits are not sufficient for many games. Valve notes that in their requirements document (under "FD LIMIT REQUIREMENTS") [6] and it is further described in the esync readme [5]. The documents also say that Debian and its derivatives like Ubuntu have already raised the limit on open file descriptors, so those distributions work out of the box with esync. Fedora is one of those that doesn't. I wonder if we can consider changing that.
*Debian and Ubuntu:* I've installed both Debian (Sid) and Ubuntu (18.10) to verify this, and can confirm it. The default soft limit stays the same (1024), but the hard limit is increased from 4096 to 1048576 (2^20). However, this only applies to the systemd's system instance (systemd --system, PID 1), and not to systemd user instances (systemd --user). Most of the apps you start in your session are children of gnome-shell, which doesn't run under the systemd user instance, so the higher limits apply for them as well (including Steam). However some apps (probably started via dbus, I'm not really sure, but importantly this includes also gnome-terminal) are started as children of the systemd user instance and therefore have the original low limits applied. I don't know whether this is intentional or just an omission. I tried really hard to find the place where Debian/Ubuntu patches the upstream limits, so that I could read some justification/explanation of the change, but I wasn't able to find it (I searched the available patches for kernel, systemd and pam).
*Configuring the limits:* You can display the soft+hard limits of your current terminal using ulimit: $ ulimit -Sn 1024 $ ulimit -Hn 4096 However, note that gnome-terminal runs under the systemd user session, so at least in Debian/Ubuntu this will not see higher limits (i.e. neither will Steam started from the terminal).
You can also use prlimit to see the limits of any running process. You can use this to check limits of the systemd system instance, systemd user instance, running Steam, etc. $ sudo prlimit --nofile --pid 1 RESOURCE DESCRIPTION SOFT HARD UNITS NOFILE max number of open files 1048576 1048576 files
You can modify the limits on the fly like this: $ sudo prlimit --nofile=1024:1048576 --pid PID
You can increase the default limits by editing /etc/systemd/system.conf (and /etc/systemd/user.conf, if you want to edit systemd user session as well) and setting: DefaultLimitNOFILE=1024:1048576
Alternatively, you can drop a file like this: [Manager] DefaultLimitNOFILE=1024:1048576 into /etc/systemd/system.conf.d/ (and /etc/systemd/user.conf.d/).
*Default limits in Fedora:* From a technical point of view I'm not able to judge whether raising the fileno limits by default is a trivial change or something with important security implications. That's why I'm writing this email to hopefully get replies from more knowledgeable people. The fact that Debian raised the limits gives me hope we could do the same in Fedora (perhaps just in Workstation, if the change in not welcome in the whole distribution). If somebody can find the justification of Debian devs, that would be great. I'd very much like to see Fedora (Workstation) being a good choice for Linux gamers (we already packaged gamemode), and this might be an important step to make sure Steam games don't work worse than on Ubuntu (but hopefully even better, due to our more recent drivers).
Thanks for your feedback.
[1] https://steamcommunity.com/games/221410/announcements/detail/169605585573935... [2] https://github.com/ValveSoftware/Proton [3] https://spcr.netlify.com [4] https://github.com/doitsujin/dxvk [5] https://github.com/zfigura/wine/blob/esync/README.esync [6] https://github.com/ValveSoftware/Proton/blob/proton_3.7/PREREQS.md#fd-limit-...
On Fri, Oct 5, 2018 at 11:31 AM, Kamil Paral kparal@redhat.com wrote:
(cross-posting to devel and desktop lists, ideally reply to both)
Debian and Ubuntu: I've installed both Debian (Sid) and Ubuntu (18.10) to verify this, and can confirm it. The default soft limit stays the same (1024), but the hard limit is increased from 4096 to 1048576 (2^20). However, this only applies to the systemd's system instance (systemd --system, PID 1), and not to systemd user instances (systemd --user).
It seems uncontroversial to at least raise it to 65535, about an order magnitude, rather than three. And apply it to both system and user instances.
On Fr, 05.10.18 19:31, Kamil Paral (kparal@redhat.com) wrote:
(cross-posting to devel and desktop lists, ideally reply to both)
Coincidentally, at All Systems Go! in Berlin last week I had some discussions with kernel people about RLIMIT_NOFILE defaults. They basically suggested that the memory and performance cost of large numbers of fds on current kernels is cheap, and that we should bump the hard limit in systemd for all userspace processes.
I have thus prepared this a few days ago:
https://github.com/systemd/systemd/pull/10244
This should have the effect on systemd systems that do not patch around in RLIMIT_NOFILE otherwise that the new default hard limit for all userspace is 256K (though the soft limit remains at 1K, for compat with select()). AFAIK Fedora doesn't override RLIMIT_NOFILE artificially, hence these new systemd upstream defaults should trickle down to Fedora too.
This is not quite the 1M you appear to ask for though… I picked 256K mostly because I wanted to stay lower than the kernel built-in max (which is 1M, i.e. /proc/sys/fs/nr_open), and needed to pick something. Do you have any particular reason to prefer 1M over 256K? I am completely open to suggestions there...
Lennart
First off, thanks, Kamil, for starting this discussion. I've been meaning to bring it up.
On 10/5/18 1:03 PM, Lennart Poettering wrote: [snip]
This is not quite the 1M you appear to ask for though… I picked 256K mostly because I wanted to stay lower than the kernel built-in max (which is 1M, i.e. /proc/sys/fs/nr_open), and needed to pick something. Do you have any particular reason to prefer 1M over 256K? I am completely open to suggestions there...
The upstream esync branch requests setting the hard limit to 1M.
https://github.com/zfigura/wine/blob/esync/README.esync
I haven't torn apart the project to see if 1M is really necessary so a different limit may be up for discussion.
Regards, Michael
* Lennart Poettering:
On Fr, 05.10.18 19:31, Kamil Paral (kparal@redhat.com) wrote:
(cross-posting to devel and desktop lists, ideally reply to both)
Coincidentally, at All Systems Go! in Berlin last week I had some discussions with kernel people about RLIMIT_NOFILE defaults. They basically suggested that the memory and performance cost of large numbers of fds on current kernels is cheap, and that we should bump the hard limit in systemd for all userspace processes.
Which kernel version is that? Is that a new patch? Or some older kernel?
It's definitely not true for kernel 4.18, see the script I posted.
Thanks, Florian
On Fr, 19.10.18 09:12, Florian Weimer (fweimer@redhat.com) wrote:
(cross-posting to devel and desktop lists, ideally reply to both)
Coincidentally, at All Systems Go! in Berlin last week I had some discussions with kernel people about RLIMIT_NOFILE defaults. They basically suggested that the memory and performance cost of large numbers of fds on current kernels is cheap, and that we should bump the hard limit in systemd for all userspace processes.
Which kernel version is that? Is that a new patch? Or some older kernel?
It's definitely not true for kernel 4.18, see the script I posted.
I inquired Tejun Heo about this all, this is what he replied:
<snip> In cgroup1, socket buffers are handled by a separate memory sub-controller. It's cumbersome to use, somewhat broken and doesn't allow for comprehensive memory control. cgroup2, however, by default accounts socket buffer as part of a given cgroup's memory consumption correctly interacting with socket window management.
OOM killer too fails to take socket buffer into account and high number of sockets can lead it to make ineffective decisions; however, this failure mode isn't confined to high number of sockets at all - fewer number of fat pipes, tmpfs, mount points or any other kernel objects which can be pinned by processes can trigger this.
cgroup2 can track or control most of these usages and at least for us switching to oomd for workload health management solves most of the problems that we've encountered. In the longer term, the kernel OOM killer can be improved to make better decisions too. </snip>
("us" in the above is facebook btw.)
So, yeah, if we'd use cgroupv2 on Fedora, then everything would be great (unfortunately the container messiness blocks that for now). But as long as we don't, lifting the fd limit is not really making things worse, given that there are tons of other easily exploitable ways to acquire untracked memory...
Lennart
* Lennart Poettering:
On Fr, 19.10.18 09:12, Florian Weimer (fweimer@redhat.com) wrote:
(cross-posting to devel and desktop lists, ideally reply to both)
Coincidentally, at All Systems Go! in Berlin last week I had some discussions with kernel people about RLIMIT_NOFILE defaults. They basically suggested that the memory and performance cost of large numbers of fds on current kernels is cheap, and that we should bump the hard limit in systemd for all userspace processes.
Which kernel version is that? Is that a new patch? Or some older kernel?
It's definitely not true for kernel 4.18, see the script I posted.
I inquired Tejun Heo about this all, this is what he replied:
So, yeah, if we'd use cgroupv2 on Fedora, then everything would be great (unfortunately the container messiness blocks that for now). But as long as we don't, lifting the fd limit is not really making things worse, given that there are tons of other easily exploitable ways to acquire untracked memory...
How does cgroupv2 solve this if we do not configure hard limits for the user session? I don't want us to go back to static resource allocation for applications, similar to what System 9 did.
Anyway, the problem suggests to me that the default soft limit should not be raised until the kernel gets better recovery, so that applications won't trigger the issue by accident.
Thanks, Florian
On Mo, 22.10.18 11:58, Florian Weimer (fweimer@redhat.com) wrote:
Anyway, the problem suggests to me that the default soft limit should not be raised until the kernel gets better recovery, so that applications won't trigger the issue by accident.
During the whole discussions we always made clear that we can't and won't change the default soft limit, because of the incompatibility with select(), which cannot deal with fds > 1024. i.e. there always needs to be the explicit "opt-in" step for apps to say "i am happy with fds > 1024" (aka, "I promise not to use select()") by bumping the soft limit to the high limit.
Lennart
* Kamil Paral:
From a technical point of view I'm not able to judge whether raising the fileno limits by default is a trivial change or something with important security implications.
It has implications for reliability (and perhaps security). File descriptors can refer to sockets, and each socket can have a fairly large amount of unswappable kernel memory associated with it. This memory is not tracked along with the process that created the sockets or has them opened, so the OOM killer does not take it into account when selecting processes to terminate.
The attached script, when run with “python3 many-sockets.py 50000” as a regular user, after raising the limit, tricks the OOM killer into terminating processes. Important processes such as systemd-journal fail because the OOM killer cannot recover any memory. It even terminates processes which are already fully swapped out.
I think a reasonable file descriptor limit is an important safety net.
Thanks, Florian
desktop@lists.fedoraproject.org