How did /dev/shm get noexec in Fedora 15 rawhide? $ grep /dev/shm /proc/mounts tmpfs /dev/shm tmpfs rw,seclabel,nosuid,nodev,noexec,relatime 0 0 $ grep -srl noexec /etc /etc/alternatives/ld /etc/fstab ## derived from /proc/mounts /etc/mtab ## derived from /proc/mounts
This is a change from Fedora 14, and I cannot find documentation. The only 'noexec' that I can find in the source to systemd-15 is two mentions in units/var-{lock,run}.mount.
As a site administrator, how can I change the default to omit 'noexec'? As a project leader, how can I get my project's programs working again if I do not have the privileges of a site administrator?
The project is a database system that creates and dlopen()s plugins on-the-fly, for better performance on ["long-running"] queries. We like the speed of creat+write+close+open+read+mmap on /dev/shm. If /dev/shm and /tmp both become off limits, then what is the recommended replacement location?
--
On Sun, Dec 12, 2010 at 07:49:27PM -0800, John Reiser wrote:
How did /dev/shm get noexec in Fedora 15 rawhide? $ grep /dev/shm /proc/mounts tmpfs /dev/shm tmpfs rw,seclabel,nosuid,nodev,noexec,relatime 0 0 $ grep -srl noexec /etc /etc/alternatives/ld /etc/fstab ## derived from /proc/mounts /etc/mtab ## derived from /proc/mounts
This is a change from Fedora 14, and I cannot find documentation. The only 'noexec' that I can find in the source to systemd-15 is two mentions in units/var-{lock,run}.mount.
the MS_NOEXEC flags is in private systemd fstab, see systemd/src/mount-setup.c:
static const MountPoint mount_table[] = { { "proc", "/proc", "proc", NULL, MS_NOSUID|MS_NOEXEC|MS_NODEV, true }, { "sysfs", "/sys", "sysfs", NULL, MS_NOSUID|MS_NOEXEC|MS_NODEV, true }, { "devtmpfs", "/dev", "devtmpfs", "mode=755", MS_NOSUID, true }, { "tmpfs", "/dev/shm", "tmpfs", "mode=1777", MS_NOSUID|MS_NOEXEC|MS_NODEV, true }, { "devpts", "/dev/pts", "devpts", NULL, MS_NOSUID|MS_NOEXEC, false }, { "tmpfs", "/sys/fs/cgroup", "tmpfs", "mode=755", MS_NOSUID|MS_NOEXEC|MS_NODEV, true }, { "cgroup", "/sys/fs/cgroup/systemd", "cgroup", "none,name=systemd", MS_NOSUID|MS_NOEXEC|MS_NODEV, true }, };
As a site administrator, how can I change the default to omit 'noexec'?
mount -o remount,exec ?
Karel
On 12/13/2010 7:37, Karel Zak wrote:
On Sun, Dec 12, 2010 at 07:49:27PM -0800, John Reiser wrote:
How did /dev/shm get noexec in Fedora 15 rawhide? $ grep /dev/shm /proc/mounts tmpfs /dev/shm tmpfs rw,seclabel,nosuid,nodev,noexec,relatime 0 0 $ grep -srl noexec /etc /etc/alternatives/ld /etc/fstab ## derived from /proc/mounts /etc/mtab ## derived from /proc/mounts
This is a change from Fedora 14, and I cannot find documentation. The only 'noexec' that I can find in the source to systemd-15 is two mentions in units/var-{lock,run}.mount.
the MS_NOEXEC flags is in private systemd fstab, see systemd/src/mount-setup.c:
static const MountPoint mount_table[] = { { "proc", "/proc", "proc", NULL, MS_NOSUID|MS_NOEXEC|MS_NODEV, true }, { "sysfs", "/sys", "sysfs", NULL, MS_NOSUID|MS_NOEXEC|MS_NODEV, true }, { "devtmpfs", "/dev", "devtmpfs", "mode=755", MS_NOSUID, true }, { "tmpfs", "/dev/shm", "tmpfs", "mode=1777", MS_NOSUID|MS_NOEXEC|MS_NODEV, true }, { "devpts", "/dev/pts", "devpts", NULL, MS_NOSUID|MS_NOEXEC, false }, { "tmpfs", "/sys/fs/cgroup", "tmpfs", "mode=755", MS_NOSUID|MS_NOEXEC|MS_NODEV, true }, { "cgroup", "/sys/fs/cgroup/systemd", "cgroup", "none,name=systemd", MS_NOSUID|MS_NOEXEC|MS_NODEV, true }, };
As a site administrator, how can I change the default to omit 'noexec'?
mount -o remount,exec ?
If systemd is going to ignore fstab entries, could we please have the fstab file on newly-installed systems replace the entries that would be ignored with commentary that explains which filesystems will be ignored?
That said, this should really be configurable without recompiling the init system.
On Mon, Dec 13, 2010 at 09:47:49AM -0600, Garrett Holmstrom wrote:
If systemd is going to ignore fstab entries, could we please have the fstab file on newly-installed systems replace the entries that would be ignored with commentary that explains which filesystems will be ignored?
That said, this should really be configurable without recompiling the init system.
Amen to that.
It's crazy to have these things hard-coded into a C program.
Rich.
Karel Zak kzak@redhat.com wrote:
As a site administrator, how can I change the default to omit 'noexec'?
mount -o remount,exec ?
That's not really changing the default.
David
Hi,
On Monday, 13 December 2010 at 14:37, Karel Zak wrote:
On Sun, Dec 12, 2010 at 07:49:27PM -0800, John Reiser wrote:
How did /dev/shm get noexec in Fedora 15 rawhide? $ grep /dev/shm /proc/mounts tmpfs /dev/shm tmpfs rw,seclabel,nosuid,nodev,noexec,relatime 0 0 $ grep -srl noexec /etc /etc/alternatives/ld /etc/fstab ## derived from /proc/mounts /etc/mtab ## derived from /proc/mounts
This is a change from Fedora 14, and I cannot find documentation. The only 'noexec' that I can find in the source to systemd-15 is two mentions in units/var-{lock,run}.mount.
the MS_NOEXEC flags is in private systemd fstab, see systemd/src/mount-setup.c:
You're not kidding. Could the author of this code (I'm guessing... Lennart?) please explain this extremely bright idea of hard-coding what should be admin-configurable?
Regards, Dominik
On Mon, Dec 13, 2010 at 11:57:51PM +0100, Dominik 'Rathann' Mierzejewski wrote:
the MS_NOEXEC flags is in private systemd fstab, see systemd/src/mount-setup.c:
You're not kidding. Could the author of this code (I'm guessing... Lennart?) please explain this extremely bright idea of hard-coding what should be admin-configurable?
That's not a very constructive wording. Filing a bug showing your use-case would be helpful.
Matthew Miller píše v Út 14. 12. 2010 v 07:39 -0500:
On Mon, Dec 13, 2010 at 11:57:51PM +0100, Dominik 'Rathann' Mierzejewski wrote:
the MS_NOEXEC flags is in private systemd fstab, see systemd/src/mount-setup.c:
You're not kidding. Could the author of this code (I'm guessing... Lennart?) please explain this extremely bright idea of hard-coding what should be admin-configurable?
That's not a very constructive wording. Filing a bug showing your use-case would be helpful.
Changing the semantics of /etc/fstab without any consultation with fedora-devel or even notification of Fedora that something so long-standing is changing is hardly constructive either.
I can happily live with "systemd is a new, better init system" without knowing the details. I consider "systemd replaces 15% of /etc and changes semantics of another 5%" without discussing the details in advance unacceptable for the distribution as a whole, although this decision is of course FESCo's. Mirek
On Tue, Dec 14, 2010 at 01:53:37PM +0100, Miloslav Trmač wrote:
Matthew Miller píše v Út 14. 12. 2010 v 07:39 -0500:
On Mon, Dec 13, 2010 at 11:57:51PM +0100, Dominik 'Rathann' Mierzejewski wrote:
the MS_NOEXEC flags is in private systemd fstab, see systemd/src/mount-setup.c:
You're not kidding. Could the author of this code (I'm guessing... Lennart?) please explain this extremely bright idea of hard-coding what should be admin-configurable?
That's not a very constructive wording. Filing a bug showing your use-case would be helpful.
Changing the semantics of /etc/fstab without any consultation with fedora-devel or even notification of Fedora that something so long-standing is changing is hardly constructive either.
I can happily live with "systemd is a new, better init system" without knowing the details. I consider "systemd replaces 15% of /etc and changes semantics of another 5%" without discussing the details in advance unacceptable for the distribution as a whole, although this decision is of course FESCo's. Mirek
Let's keep discussion calm and technical. “Systemd contains native implementations of various tasks that need to be executed as part of the boot process. For example, it sets the host name or configures the loopback network device. It also sets up and mounts various API file systems, such as /sys or /proc.”
We saw it includes /dev, /dev/shm etc. Is there any *reasonable* need to mount sysfs somewhere else than /sys. Or /dev with mode other than 755? Those all directories are mounted _identically_ on every Linux distribution down here. Why pollute fstab with repeated lines on million machines?
I can see that it may look like taking power from admin, but has anyone ever changed how devpts is mounted? Really? Being able to change for the sake of ability is not always sane. There are things which we can change, and some things which shouldn't be touched by admin. And I'm not proposing dumbing down admin. Back when I run Slackware I rewrote part of the initscripts to suit me. But really, admin should worry about important things, better leave boring (and identical across distros) parts to someone else.
Original problem could be solved by configuring some scratch tmpfs in /mnt/scratch or somewhere else.
On 12/14/2010 02:24 PM, Tomasz Torcz wrote:
On Tue, Dec 14, 2010 at 01:53:37PM +0100, Miloslav Trmač wrote:
Matthew Miller píše v Út 14. 12. 2010 v 07:39 -0500:
On Mon, Dec 13, 2010 at 11:57:51PM +0100, Dominik 'Rathann' Mierzejewski wrote:
the MS_NOEXEC flags is in private systemd fstab, see systemd/src/mount-setup.c:
You're not kidding. Could the author of this code (I'm guessing... Lennart?) please explain this extremely bright idea of hard-coding what should be admin-configurable?
That's not a very constructive wording. Filing a bug showing your use-case would be helpful.
Changing the semantics of /etc/fstab without any consultation with fedora-devel or even notification of Fedora that something so long-standing is changing is hardly constructive either.
I can happily live with "systemd is a new, better init system" without knowing the details. I consider "systemd replaces 15% of /etc and changes semantics of another 5%" without discussing the details in advance unacceptable for the distribution as a whole, although this decision is of course FESCo's. Mirek
Let's keep discussion calm and technical. “Systemd contains native implementations of various tasks that need to be executed as part of the boot process. For example, it sets the host name or configures the loopback network device. It also sets up and mounts various API file systems, such as /sys or /proc.”
We saw it includes /dev, /dev/shm etc. Is there any *reasonable* need to mount sysfs somewhere else than /sys. Or /dev with mode other than 755? Those all directories are mounted _identically_ on every Linux distribution down here. Why pollute fstab with repeated lines on million machines?
I can see that it may look like taking power from admin, but has anyone ever changed how devpts is mounted? Really? Being able to change for the sake of ability is not always sane. There are things which we can change, and some things which shouldn't be touched by admin. And I'm not proposing dumbing down admin. Back when I run Slackware I rewrote part of the initscripts to suit me. But really, admin should worry about important things, better leave boring (and identical across distros) parts to someone else.
Original problem could be solved by configuring some scratch tmpfs in /mnt/scratch or somewhere else.
The problem is not the technical solution. Problem is that changes of such important thing like /etc/fstab are decided without Fedora developers. Usually such change would be discussed before on list and it would be feature for new Fedora. It's not even mentioned on Systemd Feature page.
Marcela Mašláňová píše v Út 14. 12. 2010 v 14:55 +0100:
On 12/14/2010 02:24 PM, Tomasz Torcz wrote:
On Tue, Dec 14, 2010 at 01:53:37PM +0100, Miloslav Trmač wrote:
Changing the semantics of /etc/fstab without any consultation with fedora-devel or even notification of Fedora that something so long-standing is changing is hardly constructive either.
I can happily live with "systemd is a new, better init system" without knowing the details. I consider "systemd replaces 15% of /etc and changes semantics of another 5%" without discussing the details in advance unacceptable for the distribution as a whole, although this decision is of course FESCo's. Mirek
Let's keep discussion calm and technical. “Systemd contains native implementations of various tasks that need to be executed as part of the boot process. For example, it sets the host name or configures the loopback network device. It also sets up and mounts various API file systems, such as /sys or /proc.”
We saw it includes /dev, /dev/shm etc. Is there any *reasonable* need to mount sysfs somewhere else than /sys. Or /dev with mode other than 755? Those all directories are mounted _identically_ on every Linux distribution down here. Why pollute fstab with repeated lines on million machines?
I can see that it may look like taking power from admin, but has anyone ever changed how devpts is mounted? Really? Being able to change for the sake of ability is not always sane. There are things which we can change, and some things which shouldn't be touched by admin. And I'm not proposing dumbing down admin. Back when I run Slackware I rewrote part of the initscripts to suit me. But really, admin should worry about important things, better leave boring (and identical across distros) parts to someone else.
Original problem could be solved by configuring some scratch tmpfs in /mnt/scratch or somewhere else.
The problem is not the technical solution. Problem is that changes of such important thing like /etc/fstab are decided without Fedora developers. Usually such change would be discussed before on list and it would be feature for new Fedora. It's not even mentioned on Systemd Feature page.
+1
This is (was?) UNIX. No single person knows about all the creative and important ways that users have configured the system to suit their needs. Dropping system-wide features should be a conscious decision, not something we accidentally discover several months later when user complaints start to come in. Mirek
Marcela Mašláňová (mmaslano@redhat.com) said:
That's not a very constructive wording. Filing a bug showing your use-case would be helpful.
I'd like to restate this point. It's rather disappointing that so many people have decided to skip over this, and prefer to instead complain, insinuate, and argue on list rather than starting with this simple, more likely to be productive, action.
The problem is not the technical solution. Problem is that changes of such important thing like /etc/fstab are decided without Fedora developers.
Eh, what? It's a change to how API filesystems (/proc, /sys, etc.) get mounted. When this was done in rc.sysinit, every change to how it mounted /proc wasn't discussed on the devel list. When we switched to having dracut be the primary way that API filesystems are mounted, that wasn't put up to a FESCo vote.
And it's also not fair to say that 'Fedora developers' aren't involved; heck, there's at least 10 of them on the systemd mailing list, by a quick count. If you mean, "it wasn't posted to devel@, or it wasn't brought to FESCo", well, we don't review every change to upstream packages in this way... if we did, we'd be drowned in minutiae. I mean, I could have brought the addition on how to add multiple IPv4 addresses to interfaces to FESCo for discussion and vote, but I've got better things to do with my time.
In any case, I'm pretty sure it's not even intentional. systemd has two areas of mounting:
- systemd mounts API filesystems without them needing to be in /etc/fstab. This is for a variety of reasons - having every system installer have to write /proc, /sys, and so on is pretty wasteful. It also can give inexperienced admins the idea that it's configuration that can be changed - they then rename the mount point from /proc to /processes and *kaboom*. - systemd mounts system filesystems from /etc/fstab. This includes mount options, etc., and (I'd think) would be fairly uncontroversial.
The first of these happens before the second (as you obviously need /proc, /sys, etc. very early), however systemd already has /lib/systemd/system/systemd-remount-api-vfs.service:
... [Unit] Description=Remount API VFS ... ExecStart=/lib/systemd/systemd-remount-api-vfs ...
And if you look at that code:
/* Goes through /etc/fstab and remounts all API file systems, applying * options that are in /etc/fstab that systemd might not have * respected */
So, it just looks like an ordinary bug. File it, we can get it fixed, and we can all live happily ever after.
Bill
Bill Nottingham píše v Út 14. 12. 2010 v 12:08 -0500:
The problem is not the technical solution. Problem is that changes of such important thing like /etc/fstab are decided without Fedora developers.
Eh, what? It's a change to how API filesystems (/proc, /sys, etc.) get mounted. When this was done in rc.sysinit, every change to how it mounted /proc wasn't discussed on the devel list. When we switched to having dracut be the primary way that API filesystems are mounted, that wasn't put up to a FESCo vote.
The practical difference is that nothing broke at that time, whereas systemd tends to break thinks that users use. (I won't buy dismissing it as "mere bugs" - adding NOEXEC could hardly have been a typo.) Mirek
On 12/14/10 9:22 AM, Miloslav Trmač wrote:
Bill Nottingham píše v Út 14. 12. 2010 v 12:08 -0500:
The problem is not the technical solution. Problem is that changes of such important thing like /etc/fstab are decided without Fedora developers.
Eh, what? It's a change to how API filesystems (/proc, /sys, etc.) get mounted. When this was done in rc.sysinit, every change to how it mounted /proc wasn't discussed on the devel list. When we switched to having dracut be the primary way that API filesystems are mounted, that wasn't put up to a FESCo vote.
The practical difference is that nothing broke at that time, whereas systemd tends to break thinks that users use. (I won't buy dismissing it as "mere bugs" - adding NOEXEC could hardly have been a typo.) Mirek
Perhaps you missed the part where the bug was that the fs doesn't get remounted with the perms from fstab as by design. That's the bug.
Lets have a little less chest pounding and a little more constructive discussion, mkay?
Jesse Keating píše v Út 14. 12. 2010 v 09:47 -0800:
On 12/14/10 9:22 AM, Miloslav Trmač wrote:
Bill Nottingham píše v Út 14. 12. 2010 v 12:08 -0500:
The problem is not the technical solution. Problem is that changes of such important thing like /etc/fstab are decided without Fedora developers.
Eh, what? It's a change to how API filesystems (/proc, /sys, etc.) get mounted. When this was done in rc.sysinit, every change to how it mounted /proc wasn't discussed on the devel list. When we switched to having dracut be the primary way that API filesystems are mounted, that wasn't put up to a FESCo vote.
The practical difference is that nothing broke at that time, whereas systemd tends to break thinks that users use. (I won't buy dismissing it as "mere bugs" - adding NOEXEC could hardly have been a typo.)
Perhaps you missed the part where the bug was that the fs doesn't get remounted with the perms from fstab as by design. That's the bug.
So the design was to 1) change the setting in the C reimplementation 2) add a new facility that will revert the setting to its original value ?
Is it really surprising that I'd like more discussion of the systemd design in advance? Mirek
Miloslav Trmač (mitr@volny.cz) said:
So the design was to
- change the setting in the C reimplementation
The design was to pick a default... it's actually been that way since the initial implementation and that *is* the default on some other distributions.
It probably should be relnoted, sure.
- add a new facility that will revert the setting to its original value
No, the facility is intended to apply fstab settings to any early mounted filesystem, including filesystems mounted in initramfs, etc. This is actually something that didn't exist before - for example, in earlier Fedora releases, for some filesystems you were stuck with whatever options rc.sysinit or dracut mounted them with, regardless of what's in /etc/fstab.
Bill
On Tue, 14.12.10 18:22, Miloslav Trmač (mitr@volny.cz) wrote:
Bill Nottingham píše v Út 14. 12. 2010 v 12:08 -0500:
The problem is not the technical solution. Problem is that changes of such important thing like /etc/fstab are decided without Fedora developers.
Eh, what? It's a change to how API filesystems (/proc, /sys, etc.) get mounted. When this was done in rc.sysinit, every change to how it mounted /proc wasn't discussed on the devel list. When we switched to having dracut be the primary way that API filesystems are mounted, that wasn't put up to a FESCo vote.
The practical difference is that nothing broke at that time, whereas systemd tends to break thinks that users use. (I won't buy dismissing it as "mere bugs" - adding NOEXEC could hardly have been a typo.) Mirek
"tends to break"? On what is that founded? Have you filed bugs?
Lennart
Once upon a time, Bill Nottingham notting@redhat.com said:
having every system installer have to write /proc, /sys, and so on is pretty wasteful.
I've seen this said at least a couple of times. In what way is it "wasteful"? On most systems, /etc/fstab is going to be less than one filesystem block anyway, so there is absolutely zero "waste" going on.
If "waste" of a few dozen bytes is now an issue, /etc/fstab is hardly the place to start.
Chris Adams (cmadams@hiwaay.net) said:
I've seen this said at least a couple of times. In what way is it "wasteful"? On most systems, /etc/fstab is going to be less than one filesystem block anyway, so there is absolutely zero "waste" going on.
If "waste" of a few dozen bytes is now an issue, /etc/fstab is hardly the place to start.
The waste is the code in anaconda that's required to write this on every install. Then, if new filesystems are added between releases, you need to 1) patch anaconda 2) have truly gross %post scripts to edit /etc/fstab, or 3) you write code that just hardcodes the mount anyway.
And again, listing things like /sys in fstab can just give the uninitiated the idea that it's something they can change... it's *not* a configuration setting.
Bill
On Tue, 2010-12-14 at 13:48 -0500, Bill Nottingham wrote:
And again, listing things like /sys in fstab can just give the uninitiated the idea that it's something they can change... it's *not* a configuration setting.
But I want to mount my /sys over nfs! What do you mean, it's not going to work? Make it!
On Tue, 14.12.10 12:08, Bill Nottingham (notting@redhat.com) wrote:
Thanks, Bill, for replying in so much detail.
Here are a few other points:
- systemd mounts API filesystems without them needing to be in /etc/fstab. This is for a variety of reasons - having every system installer have to write /proc, /sys, and so on is pretty wasteful. It also can give inexperienced admins the idea that it's configuration that can be changed - they then rename the mount point from /proc to /processes and *kaboom*.
The main reason we mount /sys and /proc and friends in C code this early is that we simply need them ourselves. To do what systemd does, it must be able to rely that it can read process data from /proc, or device information from /sys, or cgroup information from /sys/fs/cgroup.
There is simply no way around this, and just to make a point, Upstart mounts some of those FS too, in C code (however not /dev/shm), there's very little effective difference here, and if people whine and say "things have never een done this way, you a are breaking UNIX", then I can only reply, that that's simply wrong.
Having said all this I actually believe that there is very little point in listing "API" file systems like these in /etc/fstab. They are after all API, hence only relevant for application code, not really useful for direct interaction or even reconfiguation by the user. Or in other words: While it definitely makes sense to ount /dev/sda5 to whatever mount point the user chooses, the mount point and options for the API file systems are pretty much an implementation detail for the OS, and there should never be a need to change them.
In order to make things secure we minimize what is allowd on the various API file systems we mount. That includes that we set noexec and similar options for the file systems involved. The interface how to access /dev/shm is called shm_open(), and given that this is how it is there is very little reason to allow people to execute binaries from them. Of course, this is a very recent change, and at this point while we assume that this will not break any valid use of this directory, we cannot be sure about this, so we'd be very interested to learn why exactly you want the noexec to be dropped. What is your application that needs that? If there is a point in dropping the noexec, we'll definitely be willing to do so, but if the only reason would be "I always misused /dev/shm as a scratch space", then we won't be very convinced. The API fom /dev/shm is shm_open(), and if you place other stuff in there, then you are misusing it and actually creating all kinds of namespacing problems (since /dev/shm is actually an all-user shared namespace), and we aren't particularly keen to support such misuses by default.
That said, because we anticipated that there are some valid uses to change the settings of these mount points (e.g. change size= for tmpfs) we actually do apply the options from /etc/fstab, if the file system is listed there. So if you really really want to misuse /dev/shm, you may. Apparently this particular feature was broken (see Bill's comments), and hence please file a bug about this.
So, in the long run, I believe /etc/fstab should only list real disk and network file systems, and all the API file systems should not be visible there. The list of API file systems mounted and the list of API file systems configured in this file usually has been differing anyway, and hence I would simply not list them by default there anymore at all. You could even say that this brings /etc/fstab back to its traditional roots, since the glut of virtual API file systems is actually a very recent change in history, and for the longest time /proc was really the only one ever used. So, Unix-Lovers, please say "thank you", we are bringing back to you a piece of good old Unix/Linux, that has long been taken away from you, by evil Unix-haters!
[ and again, "not listing by default" by no means means that you couldnt list them there if you wanted to to or that your options would be ignored by design -- as soon as the aforementioned bug is fixed ]
(Sorry for no responding more timely. i have been (and still am) backpacking through India, and my access to the Internet has been only sporadic and slow)
Lennart
On 12/14/2010 21:28, Lennart Poettering wrote:
On Tue, 14.12.10 12:08, Bill Nottingham (notting@redhat.com) wrote:
Thanks, Bill, for replying in so much detail.
Here are a few other points:
- systemd mounts API filesystems without them needing to be in /etc/fstab. This is for a variety of reasons - having every system installer have to write /proc, /sys, and so on is pretty wasteful. It also can give inexperienced admins the idea that it's configuration that can be changed - they then rename the mount point from /proc to /processes and *kaboom*.
The main reason we mount /sys and /proc and friends in C code this early is that we simply need them ourselves. To do what systemd does, it must be able to rely that it can read process data from /proc, or device information from /sys, or cgroup information from /sys/fs/cgroup.
There is simply no way around this, and just to make a point, Upstart mounts some of those FS too, in C code (however not /dev/shm), there's very little effective difference here, and if people whine and say "things have never een done this way, you a are breaking UNIX", then I can only reply, that that's simply wrong.
Having said all this I actually believe that there is very little point in listing "API" file systems like these in /etc/fstab. They are after all API, hence only relevant for application code, not really useful for direct interaction or even reconfiguation by the user. Or in other words: While it definitely makes sense to ount /dev/sda5 to whatever mount point the user chooses, the mount point and options for the API file systems are pretty much an implementation detail for the OS, and there should never be a need to change them.
In order to make things secure we minimize what is allowd on the various API file systems we mount. That includes that we set noexec and similar options for the file systems involved. The interface how to access /dev/shm is called shm_open(), and given that this is how it is there is very little reason to allow people to execute binaries from them. Of course, this is a very recent change, and at this point while we assume that this will not break any valid use of this directory, we cannot be sure about this, so we'd be very interested to learn why exactly you want the noexec to be dropped. What is your application that needs that? If there is a point in dropping the noexec, we'll definitely be willing to do so, but if the only reason would be "I always misused /dev/shm as a scratch space", then we won't be very convinced. The API fom /dev/shm is shm_open(), and if you place other stuff in there, then you are misusing it and actually creating all kinds of namespacing problems (since /dev/shm is actually an all-user shared namespace), and we aren't particularly keen to support such misuses by default.
That said, because we anticipated that there are some valid uses to change the settings of these mount points (e.g. change size= for tmpfs) we actually do apply the options from /etc/fstab, if the file system is listed there. So if you really really want to misuse /dev/shm, you may. Apparently this particular feature was broken (see Bill's comments), and hence please file a bug about this.
I'm fine with that as long as it is documented, particularly in the fstab man page and as commentary in /etc/fstab on newly-installed systems so people who read it and notice missing filesystems don't panic. Thanks for explaining your thought process.
It sounds like /tmp would be a better location to remove noexec from than /dev/shm if one needs memory-backed storage for things and doesn't want to create a new filesystem for that purpose.
On 12/14/2010 07:28 PM, Lennart Poettering wrote:
In order to make things secure we minimize what is allowd on the various API file systems we mount. That includes that we set noexec and similar options for the file systems involved. The interface how to access /dev/shm is called shm_open(), and given that this is how it is there is very little reason to allow people to execute binaries from them. Of course, this is a very recent change, and at this point while we assume that this will not break any valid use of this directory, we cannot be sure about this, so we'd be very interested to learn why exactly you want the noexec to be dropped. What is your application that needs that? If there is a point in dropping the noexec, we'll definitely be willing to do so, but if the only reason would be "I always misused /dev/shm as a scratch space", then we won't be very convinced. The API fom /dev/shm is shm_open(), and if you place other stuff in there, then you are misusing it and actually creating all kinds of namespacing problems (since /dev/shm is actually an all-user shared namespace), and we aren't particularly keen to support such misuses by default.
shm_open() takes the standard mode flags, and mmap() with PROT_EXEC on a +x fd returned by shm_open() is a legitimate operation that is required by POSIX.
This is a perfectly reasonable thing to do on a SELinux-enabled system which requires e.g. a JIT to write generated code to the writable mapping and execute that code from the executable mapping of the same shared memory object.
On Tue, 14.12.10 21:11, Nicholas Miell (nmiell@gmail.com) wrote:
On 12/14/2010 07:28 PM, Lennart Poettering wrote:
In order to make things secure we minimize what is allowd on the various API file systems we mount. That includes that we set noexec and similar options for the file systems involved. The interface how to access /dev/shm is called shm_open(), and given that this is how it is there is very little reason to allow people to execute binaries from them. Of course, this is a very recent change, and at this point while we assume that this will not break any valid use of this directory, we cannot be sure about this, so we'd be very interested to learn why exactly you want the noexec to be dropped. What is your application that needs that? If there is a point in dropping the noexec, we'll definitely be willing to do so, but if the only reason would be "I always misused /dev/shm as a scratch space", then we won't be very convinced. The API fom /dev/shm is shm_open(), and if you place other stuff in there, then you are misusing it and actually creating all kinds of namespacing problems (since /dev/shm is actually an all-user shared namespace), and we aren't particularly keen to support such misuses by default.
shm_open() takes the standard mode flags, and mmap() with PROT_EXEC on a +x fd returned by shm_open() is a legitimate operation that is required by POSIX.
This is a perfectly reasonable thing to do on a SELinux-enabled system which requires e.g. a JIT to write generated code to the writable mapping and execute that code from the executable mapping of the same shared memory object.
These are good and valid points. I have now dropped noexec from the default flags for /dev/shm.
Lennart
On 12/14/2010 07:28 PM, Lennart Poettering wrote:
In order to make things secure we minimize what is allowd on the various API file systems we mount. That includes that we set noexec and similar options for the file systems involved. The interface how to access /dev/shm is called shm_open(), and given that this is how it is there is very little reason to allow people to execute binaries from them. Of course, this is a very recent change, and at this point while we assume that this will not break any valid use of this directory, we cannot be sure about this, so we'd be very interested to learn why exactly you want the noexec to be dropped. What is your application that needs that? If there is a point in dropping the noexec, we'll definitely be willing to do so, but if the only reason would be "I always misused /dev/shm as a scratch space", then we won't be very convinced. The API fom /dev/shm is shm_open(), and if you place other stuff in there, then you are misusing it and actually creating all kinds of namespacing problems (since /dev/shm is actually an all-user shared namespace), and we aren't particularly keen to support such misuses by default.
The claim "The API for /dev/shm is shm_open()" is incorrect. Very early in the history of shm [late 1970's at the Columbus, Ohio, USA branch of Bell Telephone Laboratories], then shm_open, shmget, etc., were the only means of access; the objects had names that were 32-bit binary integers. In fact, when shm became more widely used then there were denial-of-service attacks based on the premise that enumerating objects in shm required 2**32 exhaustive search via shmget. As soon as /dev/shm was integrated into the filesystem, then creat, open, read, write, close, lseek, execve, etc. (any filesystem API) became additional access paths. This integration began appearing by about the mid 1980's, around 25 years ago, and since then applications have been using /dev/shm via ordinary files system APIs in addition to shmget etc.
Why? Because *fast* operations on small numbers of small-to-medium-sized files can be a big advantage for performance. /tmp often is much slower because /tmp often is a harddrive: the need for space in /tmp often exceeds the size of physical RAM. Also, mounting /tmp as tmpfs can meet resistance because tmpfs does not support all features that applications expect. A ramdisk might be used, except that early ramdisks allowed at most a few megabytes (comparable to the capacity of a floppy disk), which is not large enough to support typical simultaneous usage. Applications also cannot rely on ramdisks because superuser privileges usually are required to access a ramdisk. In many cases ramdisks have been replaced by: /dev/shm !!
I have applications which use /dev/shm via file system APIs, including execve() and dlopen(). Both of those fail when /dev/shm has MS_NOEXEC. One group of applications generates database plugins on-the-fly in a just-in-time fashion. Of course non-interactive performance increases in the usual way that substituting compiled code for interpreted often gives a speedup of 8X or more. Interactive response also improves, because small files in /dev/shm do not contend with operations in /tmp which can require slow sync() or large transfers. In some cases even /dev/shm is slower than desirable. I have requested dlopen() from memory: http://sourceware.org/bugzilla/show_bug.cgi?id=11767 . Meanwhile, /dev/shm is the only choice which is present always and sufficiently fast.
It is just not true that file system APIs are a misuse of /dev/shm.
--
Once upon a time, Tomasz Torcz tomek@pipebreaker.pl said:
We saw it includes /dev, /dev/shm etc. Is there any *reasonable* need to mount sysfs somewhere else than /sys. Or /dev with mode other than 755? Those all directories are mounted _identically_ on every Linux distribution down here. Why pollute fstab with repeated lines on million machines?
What is the advantage to making some mounts not listed in the file with all the other mounts? It isn't like /etc/fstab is a hundred lines or anything; it is a standard config file that predates Linux. All mounts are listed there until systemd decided to override it (without any warning or documentation).
On Tue, 14.12.10 08:08, Chris Adams (cmadams@hiwaay.net) wrote:
Once upon a time, Tomasz Torcz tomek@pipebreaker.pl said:
We saw it includes /dev, /dev/shm etc. Is there any *reasonable* need to mount sysfs somewhere else than /sys. Or /dev with mode other than 755? Those all directories are mounted _identically_ on every Linux distribution down here. Why pollute fstab with repeated lines on million machines?
What is the advantage to making some mounts not listed in the file with all the other mounts? It isn't like /etc/fstab is a hundred lines or anything; it is a standard config file that predates Linux. All mounts are listed there until systemd decided to override it (without any warning or documentation).
Well, what would be the advantage of listing it? Confusing the admin with lines that are an implementation detail of the OS? Or giving the admin the suggestion to maybe change the mount point of procfs to /waldo and see how everyting breaks?
Also, the list in /etc/fstab never was complete anyway. It never listed /selinux, neither /sys/fs/cgroup (or its predecessor /cgroup), or /sys/kernel/security, or /dev/hugepages, or /dev/mqueue, or binfmt_misc, or /sys/kernel/debug, or the rpc_pipefs, or the fuse connections fs.
(Also, this discussion is premature anyway, since I have not asked the Anaconda team to drop the default procfs/sysfs lines from fstab, and won't do so before F16).
Lennart
On Tue, Dec 14, 2010 at 02:24:53PM +0100, Tomasz Torcz wrote:
We saw it includes /dev, /dev/shm etc. Is there any *reasonable* need to mount sysfs somewhere else than /sys. Or /dev with mode other than 755? Those all directories are mounted _identically_ on every Linux distribution down here. Why pollute fstab with repeated lines on million machines?
The issue here isn't that the reporter wanted to mount them somewhere else, but he wanted to set the default mount options to something else (or in fact to set them back to how they are now -- systemd has decided to use some other mount options entirely without consulting anyone else).
I think it's very reasonable to want to edit /etc/fstab to change the default mount options of these filesystems. Suppose that /dev/shm defaults to allowing suid and exec. At some point in the future a security problem is found which can be worked around by temporarily setting nosuid on /dev/shm (while the real issue is fixed). An administrator can't do that without recompiling systemd.
Rich.
On Tue, Dec 14, 2010 at 02:25:38PM +0000, Richard W.M. Jones wrote:
I think it's very reasonable to want to edit /etc/fstab to change the default mount options of these filesystems. Suppose that /dev/shm defaults to allowing suid and exec. At some point in the future a security problem is found which can be worked around by temporarily setting nosuid on /dev/shm (while the real issue is fixed). An administrator can't do that without recompiling systemd.
I'm not sure there's a win in having systemd do magic rather than just using fstab -- reminds me of IRIX and its auto-mounting of some but not all swap partitions. (Yay newbie admin confusion!)
But if there's a good technical reason, it still seems reasonable to let /etc/fstab override the defaults.
On Tue, Dec 14, 2010 at 02:25:38PM +0000, Richard W.M. Jones wrote:
On Tue, Dec 14, 2010 at 02:24:53PM +0100, Tomasz Torcz wrote:
We saw it includes /dev, /dev/shm etc. Is there any *reasonable* need to mount sysfs somewhere else than /sys. Or /dev with mode other than 755? Those all directories are mounted _identically_ on every Linux distribution down here. Why pollute fstab with repeated lines on million machines?
The issue here isn't that the reporter wanted to mount them somewhere else, but he wanted to set the default mount options to something else (or in fact to set them back to how they are now -- systemd has decided to use some other mount options entirely without consulting anyone else).
I think it's very reasonable to want to edit /etc/fstab to change the default mount options of these filesystems. Suppose that /dev/shm defaults to allowing suid and exec. At some point in the future a security problem is found which can be worked around by temporarily setting nosuid on /dev/shm (while the real issue is fixed). An administrator can't do that without recompiling systemd.
Of course administrator can temporary override: mount /dev/shm -o remount, nosuid
Or even have it stick after reboot, by droping in /etc/systemd/system/ following unit definition¹:
-- [Unit] Description=Temporary workaround for CVE-x DefaultDependencies=false WantedBy=local-fs.target
[Service] ExecStart=/bin/mount /dev/shm -o remount, nosuid Type=oneshot --
While I agree that hidden mounts are bad idea, they're still visible in "systemctl -t mount" and "findmnt" output.
¹ created ad-hoc to show idea, not tested
On Tue, 14 Dec 2010, Tomasz Torcz wrote:
Of course administrator can temporary override: mount /dev/shm -o remount, nosuid
Or even have it stick after reboot, by droping in /etc/systemd/system/ following unit definition¹:
No.
You either follow what is in /etc/fstab, or you disallow it from /etc/fstab.
You do not ignore /etc/fstab.
And if for some bad reason you do decided to ignore /etc/fstab, this should clearly cause log entries, and there should be a clear man page section for the man page in "man fstab" explaining this.
Yes, documentation is not sexy. No source code is not documentation
Paul (yes, bitter by the horrors of 10 years of iproute2)
On Tue, 2010-12-14 at 17:54 -0500, Paul Wouters wrote:
On Tue, 14 Dec 2010, Tomasz Torcz wrote:
Of course administrator can temporary override: mount /dev/shm -o remount, nosuid
Or even have it stick after reboot, by droping in /etc/systemd/system/ following unit definition¹:
No.
You either follow what is in /etc/fstab, or you disallow it from /etc/fstab.
You do not ignore /etc/fstab.
You appear to have missed the bit where Bill explained, twice, that systemd is not actually designed to ignore /etc/fstab, and this is just a bug.
On Tue, 14.12.10 17:54, Paul Wouters (paul@xelerance.com) wrote:
On Tue, 14 Dec 2010, Tomasz Torcz wrote:
Of course administrator can temporary override: mount /dev/shm -o remount, nosuid
Or even have it stick after reboot, by droping in /etc/systemd/system/ following unit definition¹:
No.
You either follow what is in /etc/fstab, or you disallow it from /etc/fstab.
You do not ignore /etc/fstab.
And if for some bad reason you do decided to ignore /etc/fstab, this should clearly cause log entries, and there should be a clear man page section for the man page in "man fstab" explaining this.
Yes, documentation is not sexy. No source code is not documentation
systemd documentation is actually pretty good and mostly comprehensive. Humble as I am I would even say that it is vastly superior to the majority of all open source projects. If you want to criticise us on something, pick something else, please.
Yes, reading documentation is not sexy, but just bitching isn't reading documentation.
Lennart
On Wed, Dec 15, 2010 at 07:17:21AM +0100, Lennart Poettering wrote:
systemd documentation is actually pretty good and mostly comprehensive. Humble as I am I would even say that it is vastly superior to the majority of all open source projects. If you want to criticise us on something, pick something else, please.
I think we could pretty justifiably criticise you on not being humble at all. :)
But you're absolutely right about the documentation, which is indeed excellent. There's man pages for everything, and they're both comprehensive and sysadmin-friendly.
On Tue, 14.12.10 14:25, Richard W.M. Jones (rjones@redhat.com) wrote:
On Tue, Dec 14, 2010 at 02:24:53PM +0100, Tomasz Torcz wrote:
We saw it includes /dev, /dev/shm etc. Is there any *reasonable* need to mount sysfs somewhere else than /sys. Or /dev with mode other than 755? Those all directories are mounted _identically_ on every Linux distribution down here. Why pollute fstab with repeated lines on million machines?
The issue here isn't that the reporter wanted to mount them somewhere else, but he wanted to set the default mount options to something else (or in fact to set them back to how they are now -- systemd has decided to use some other mount options entirely without consulting anyone else).
Jeez. Tha's just FUD. Of course we have discussed this openly with various folks. We haven't discussed this with you, Rich, personally, but well, I'll make a note now tht I'll DoS you now with every single choice we make, to get your blessing.
Lennart
On Wed, Dec 15, 2010 at 07:21:25AM +0100, Lennart Poettering wrote:
Jeez. Tha's just FUD. Of course we have discussed this openly with various folks. We haven't discussed this with you, Rich, personally, but well, I'll make a note now tht I'll DoS you now with every single choice we make, to get your blessing.
What you don't understand is that you are throwing away the experience and knowledge of thousands of Unix system administrators. Almost of all of them do not even read this mailing list.
Rich.
On Thu, Dec 16, 2010 at 12:27:34PM +0000, Richard W.M. Jones wrote:
On Wed, Dec 15, 2010 at 07:21:25AM +0100, Lennart Poettering wrote:
Jeez. Tha's just FUD. Of course we have discussed this openly with various folks. We haven't discussed this with you, Rich, personally, but well, I'll make a note now tht I'll DoS you now with every single choice we make, to get your blessing.
What you don't understand is that you are throwing away the experience and knowledge of thousands of Unix system administrators. Almost of all of them do not even read this mailing list.
Rich.
I hate this argument.
The "experience and knowledge" claim applies to everything we could possibly change.
Change. Is. Going. To. Happen.
This is technology. Good technical professionals learn new things quickly. So to all those thousands of Unix system administrators I suggest making a purchase here:
http://www.saferacer.com/auto-racing-helmets/?cat=52
--CJD
Casey Dahlin píše v Čt 16. 12. 2010 v 11:19 -0500:
On Thu, Dec 16, 2010 at 12:27:34PM +0000, Richard W.M. Jones wrote:
What you don't understand is that you are throwing away the experience and knowledge of thousands of Unix system administrators. Almost of all of them do not even read this mailing list.
Rich.
I hate this argument.
The "experience and knowledge" claim applies to everything we could possibly change.
Change.\nIs.\nGoing.\nTo.\nHappen.
That's a too black-and-white view. Of course there is and will be change - what would we all be doing here if there were nothing to change, after all? The thing that needs to be appreciate is that every change has _costs_ on the user-base.
I can't quickly find out good numbers on the number of server users of Fedora and Fedora-derived distributions; based on http://www.centos.org/modules/newbb/viewtopic.php?topic_id=18728&forum=1... , let's stipulate that there are 1,000,000 installations (which is almost certainly a huge understatement), with 10 servers per administrator on average, so 100,000 Linux system administrators. Better numbers would be welcome.
* You simplify existing code, which changes a rarely-used configuration value that "shouldn't affect anything in most cases", nevertheless requires a release note. Say 10% of the system administrators reads the release notes, and reading the release note takes 10 seconds. The code simplification just cost our userbase more than 3 working days, with nothing to show for it. Did the code simplification save the programmers 3 days, so that at least overall there was a net benefit?
* You replace a configuration file, or change its syntax, so that old knowledge and old kickstart scripts no longer apply. Say, again, that this change affects 10% of the system administrators, and that the change is fairly trivial, so reading the documentation and updating existing configuration scripts takes only 1 hour, and validating the change and the associated administrative overhead (keeping track of the change) takes 3 hours. Now the configuration file change has cost our userbase about 19 working _years_. To be worth it across the population of system administrators, the change needs to save the average system administrator 24 minutes before the configuration method changes again, or provide some other equivalent benefit. Saving the average system administrator 24 minutes is not easy (try thinking of a configuration change that would do that), and the more frequent changes of the configuration are, the more pronounced the benefits of the feature need to be.
* You replace a whole subsystem, requiring _each_ system administrator to study the new subsystem for 10 hours, and to update the existing configuration, validate it and so on, which takes 30 hours. The change has cost our userbase a working week; to be worth it, it also needs to save each system administrator a working week. Again, the more frequent the subsystem changes are, the more pronounced the benefits of the changes need to be.
So, yes, change is going to happen, and some change is clearly good. But when there are 10 programmers on a project and 100,000 users, each change has to be _very obviously_ good, or it might be better avoided.
Especially minor changes that don't bring any measurable benefit (perhaps making the system "cleaner" or making programmer's life more convenient) but require time from each user to adapt are better abandoned than implemented. Mirek
On Thu, 2010-12-16 at 20:16 +0100, Miloslav Trmač wrote:
Casey Dahlin píše v Čt 16. 12. 2010 v 11:19 -0500:
On Thu, Dec 16, 2010 at 12:27:34PM +0000, Richard W.M. Jones wrote:
What you don't understand is that you are throwing away the experience and knowledge of thousands of Unix system administrators. Almost of all of them do not even read this mailing list.
Rich.
I hate this argument.
The "experience and knowledge" claim applies to everything we could possibly change.
Change.\nIs.\nGoing.\nTo.\nHappen.
That's a too black-and-white view. Of course there is and will be change - what would we all be doing here if there were nothing to change, after all? The thing that needs to be appreciate is that every change has _costs_ on the user-base.
[...]
So, yes, change is going to happen, and some change is clearly good. But when there are 10 programmers on a project and 100,000 users, each change has to be _very obviously_ good, or it might be better avoided.
Especially minor changes that don't bring any measurable benefit (perhaps making the system "cleaner" or making programmer's life more convenient) but require time from each user to adapt are better abandoned than implemented.
Looking at real costs and benefits is the right approach. But do not overlook potential benefits of making it practical to add features that will help the sysadmins or avoiding a security issue later that the sysadmins would otherwise have to scramble to fix (maybe not applicable to /dev/shm, but in general).
On Thu, Dec 16, 2010 at 08:16:53PM +0100, Miloslav Trmač wrote:
Casey Dahlin píše v Čt 16. 12. 2010 v 11:19 -0500:
On Thu, Dec 16, 2010 at 12:27:34PM +0000, Richard W.M. Jones wrote:
What you don't understand is that you are throwing away the experience and knowledge of thousands of Unix system administrators. Almost of all of them do not even read this mailing list.
Rich.
I hate this argument.
The "experience and knowledge" claim applies to everything we could possibly change.
Change.\nIs.\nGoing.\nTo.\nHappen.
That's a too black-and-white view. Of course there is and will be change - what would we all be doing here if there were nothing to change, after all? The thing that needs to be appreciate is that every change has _costs_ on the user-base.
I think the view I was presented with was too black-and-white. Richard began with essentially "change is bad." I responded. You've really wholly replaced the argument I was reacting to. Which is a good thing. The conversation should have begun here.
I can't quickly find out good numbers on the number of server users of Fedora and Fedora-derived distributions; based on http://www.centos.org/modules/newbb/viewtopic.php?topic_id=18728&forum=1... , let's stipulate that there are 1,000,000 installations (which is almost certainly a huge understatement), with 10 servers per administrator on average, so 100,000 Linux system administrators. Better numbers would be welcome.
http://fedoraproject.org/wiki/Statistics
That's the best we have.
Especially minor changes that don't bring any measurable benefit (perhaps making the system "cleaner" or making programmer's life more convenient) but require time from each user to adapt are better abandoned than implemented. Mirek
Measurable != significant. Great programmers and architects have an instinct for something called "defect avoidance." You can't measure it, since the unit would be "number of bugs/bug-related outages and problems which never happened." Depending on your instincts on what that value might be, "cleaner" could be the single most important thing to improve in the entire distro. You can guess my own instincts on the subject.
This sort of immeasurability is everywhere in computing. Its what causes most major corporate security breaches ("well, we haven't had a security breach in awhile, I guess we don't need to spend so much on a security team.") Its what spawned the desperate rationalization "all software has bugs," which is an excuse to not have to measure how well you avoid putting bugs in the code. For my part, I believe in trying to write software that can't break, even if I'm not always successful. Part of that effort is ripping off anything that's loose. If its purpose is questionable, or its exposed in a semantically iffy way, it needs to be ripped out.
--CJD
Casey Dahlin píše v Čt 16. 12. 2010 v 15:50 -0500:
On Thu, Dec 16, 2010 at 08:16:53PM +0100, Miloslav Trmač wrote:
Especially minor changes that don't bring any measurable benefit (perhaps making the system "cleaner" or making programmer's life more convenient) but require time from each user to adapt are better abandoned than implemented. Mirek
Measurable != significant. Great programmers and architects have an instinct for something called "defect avoidance." You can't measure it, since the unit would be "number of bugs/bug-related outages and problems which never happened." Depending on your instincts on what that value might be, "cleaner" could be the single most important thing to improve in the entire distro.
The trouble is that we can't all agree on the immeasurable benefits (but we can probably agree on the existence of the measurable costs), which is why the monster threads about systemd arrive so regularly. Mirek
On Thu, 16.12.10 22:02, Miloslav Trmač (mitr@volny.cz) wrote:
Casey Dahlin píše v Čt 16. 12. 2010 v 15:50 -0500:
On Thu, Dec 16, 2010 at 08:16:53PM +0100, Miloslav Trmač wrote:
Especially minor changes that don't bring any measurable benefit (perhaps making the system "cleaner" or making programmer's life more convenient) but require time from each user to adapt are better abandoned than implemented. Mirek
Measurable != significant. Great programmers and architects have an instinct for something called "defect avoidance." You can't measure it, since the unit would be "number of bugs/bug-related outages and problems which never happened." Depending on your instincts on what that value might be, "cleaner" could be the single most important thing to improve in the entire distro.
The trouble is that we can't all agree on the immeasurable benefits (but we can probably agree on the existence of the measurable costs), which is why the monster threads about systemd arrive so regularly.
Do they?
I guess as long as they are only about whether to set noexec on /dev/shm by default then we did quite a few things right, didn't we?
Lennart
On Tue, 14 Dec 2010, Tomasz Torcz wrote:
We saw it includes /dev, /dev/shm etc. Is there any *reasonable* need to mount sysfs somewhere else than /sys. Or /dev with mode other than 755? Those all directories are mounted _identically_ on every Linux distribution down here. Why pollute fstab with repeated lines on million machines?
Because the system is meant to be changable by people. What if 20 years ago people had harcoded /usr and /var because they knew best? Things change over time and the unix philosphy is to allow that.
The other thing is that options where possible should be in human readable format to make understanding and changing it easier. /etc/fstab sure beats some hardcoded binary.
You are reversing the logic. Keep the system flexible and transparent.
The less we put hardcoded inside the kernel, initrd, pivot root, dracut, linuxrc or systemd the better. It is easier to change a config line then to recompile software. Don't assume you can speak for everyone with your use cases.
Original problem could be solved by configuring some scratch tmpfs in /mnt/scratch or somewhere else.
the original problem i think was more "I dont understand why my fstab seems to be acting up".
The fstab file itself provides valuable documentation of implicit values. Even if I never change it, I use it.
Paul
On Tue, 14.12.10 13:53, Miloslav Trmač (mitr@volny.cz) wrote:
Changing the semantics of /etc/fstab without any consultation with fedora-devel or even notification of Fedora that something so long-standing is changing is hardly constructive either.
I can happily live with "systemd is a new, better init system" without knowing the details. I consider "systemd replaces 15% of /etc and changes semantics of another 5%" without discussing the details in advance unacceptable for the distribution as a whole, although this decision is of course FESCo's.
All these things are actually discussed very much on IRC, and systemd upstream mailing lists and similar places. Quite a few people from various distirbutions have been involved and whenever we feel it really is necessary to inroduce a configuration file for something we don't take this decision lightly, and involve a lot of people so that we come to a soltuion that people from all distributions can live with. Also, in every case where we actually introduced a new configuration file we carefully made sure to provide compatibility to the previous per-distribution configuration files. For example: every distro placed system-wide locale settings in a different configuration files. After doing a survey we felt that it was most appropriate to unify this in a new configuration file /etc/locale.conf instead of declaring any of the existing solutions the new standard for systemd-based systems. However, if systemd is built on Fedora we actually do fall back to /etc/sysconfig/i18n, to provide a sane upgrade path.
I guess what I want to say is that all of this is openly discussed with input from a lot of people. Yes, we don't have discussed this on fedora-devel, but a) I am pretty sure that the subscribers of this ML wouldn't like the amount of traffic this would generate and b) this is not really fedora-specific, but something to discuss with other distros too, and c) I haven't really experienced fedora-devel as a great place to discuss technical things with constructive input, but mostly as a place where people (not all, but definitely too many) are "negative-by-default" and like implying that we 1) want to destroy Unix/Linux, 2) are idiots or 3) would decide everything behind closed doors.
Or, to turn this around: if you want to have a say, if you want to influence systemd's design, then join devlopment upstream, or otherwise become involved. Just standing on the sidelines and expecting that we will ask you personally for your kind comments, is not going to happen.
The duty to involve yourself is on you!
Lennart
Lennart Poettering píše v St 15. 12. 2010 v 06:59 +0100:
On Tue, 14.12.10 13:53, Miloslav Trmač (mitr@volny.cz) wrote:
Changing the semantics of /etc/fstab without any consultation with fedora-devel or even notification of Fedora that something so long-standing is changing is hardly constructive either.
I can happily live with "systemd is a new, better init system" without knowing the details. I consider "systemd replaces 15% of /etc and changes semantics of another 5%" without discussing the details in advance unacceptable for the distribution as a whole, although this decision is of course FESCo's.
All these things are actually discussed very much on IRC, and systemd upstream mailing lists and similar places.
That's not what I was talking about. My point is that systemd is an "unbounded" project - looking at a system feature, I don't know whether it is in scope or out of scope to be rewritten by systemd.
From the Fedora feature page: "systemd is a replacement for SysVinit and
Upstart that acts as a system and session manager.". Based on this description, who would expect systemd to: * obsolete "crontabs" package * add yet another file identifying the distribution to /etc * introduce a new mechanism for setting the default system locale, keyboard and font layout * manage temporary directories?
It seems that "systemd is a project to replace existing distribution-specific infrastructure by a new, different infrastructure, with sometimes different defaults and mostly different primary configuration files" would be a more fitting description - but it leaves me guessing about the scope as well.
Or, to turn this around: if you want to have a say, if you want to influence systemd's design, then join devlopment upstream, or otherwise become involved.
I don't think reviewing each commit and saying "find the subject matter experts and get their sign-off before releasing this" about 10% of them, which is all I really want to say, can really count as a contribution. (Given the unbounded scope of systemd, the burden of identifying and involving the subject matter experts is necessarily on the systemd project because others can't _know_ they should be involved.) Mirek
On Sun, 12.12.10 19:49, John Reiser (jreiser@bitwagon.com) wrote:
The project is a database system that creates and dlopen()s plugins on-the-fly, for better performance on ["long-running"] queries. We like the speed of creat+write+close+open+read+mmap on /dev/shm. If /dev/shm and /tmp both become off limits, then what is the recommended replacement location?
The API for /dev/shm is shm_open(). Unless you are using that API you shouldn't really touch /dev/shm.
What's wrong with /tmp for your use cases?
Lennart
On 12/14/2010 09:37 PM, Lennart Poettering wrote:
On Sun, 12.12.10 19:49, John Reiser (jreiser@bitwagon.com) wrote:
The project is a database system that creates and dlopen()s plugins on-the-fly, for better performance on ["long-running"] queries. We like the speed of creat+write+close+open+read+mmap on /dev/shm. If /dev/shm and /tmp both become off limits, then what is the recommended replacement location?
The API for /dev/shm is shm_open(). Unless you are using that API you shouldn't really touch /dev/shm.
What's wrong with /tmp for your use cases?
As I wrote another place under this topic (at http://lists.fedoraproject.org/pipermail/devel/2010-December/147159.html which crossed in the posting mail), some applications prefer to avoid /tmp for such purposes because /tmp often is too slow: a real harddrive (needs capacity larger than RAM) with a heavy-weight file system (to provide full-featured ACLs, etc.) which often suffers contention.
Also, the claim "The API for /dev/shm is shm_open()" is incorrect. See the other message for the history. When something is in the file system, then by default the file system APIs (including creat, open, read, write, close, execve, dlopen, ...) are legitimate uses. (Originally [System V] shared memory was *not* in the file system, and this caused problems.)
--
On Tue, 14.12.10 22:19, John Reiser (jreiser@bitwagon.com) wrote:
Also, the claim "The API for /dev/shm is shm_open()" is incorrect. See the other message for the history. When something is in the file system, then by default the file system APIs (including creat, open, read, write, close, execve, dlopen, ...) are legitimate uses. (Originally [System V] shared memory was *not* in the file system, and this caused problems.)
Don't conflate SysV and POSIX shared memory. They are completely orthogonal. SysV shared memory does not appear in /dev/shm.
Lennart
On Tue, Dec 14, 2010 at 10:19:38PM -0800, John Reiser wrote:
Also, the claim "The API for /dev/shm is shm_open()" is incorrect. See the other message for the history. When something is in the file system, then by default the file system APIs (including creat, open, read, write, close, execve, dlopen, ...) are legitimate uses. (Originally [System V] shared memory was *not* in the file system, and this caused problems.)
I think you're confusing two things here. POSIX shared memory objects are implemented on Linux using a tmpfs filesystem mounted at /dev/shm.
I don't think there's a particularly good reason to use that filesystem for other uses. Just mount another tmpfs elsewhere.
On 12/15/2010 06:40 AM, Matthew Miller wrote:
On Tue, Dec 14, 2010 at 10:19:38PM -0800, John Reiser wrote:
Also, the claim "The API for /dev/shm is shm_open()" is incorrect. See the other message for the history. When something is in the file system, then by default the file system APIs (including creat, open, read, write, close, execve, dlopen, ...) are legitimate uses. (Originally [System V] shared memory was *not* in the file system, and this caused problems.)
I think you're confusing two things here. POSIX shared memory objects are implemented on Linux using a tmpfs filesystem mounted at /dev/shm.
A file system usually supports creat, open, read, write, getdents, execve, mmap(,,PROT_EXEC,,,), etc., and should expect those calls to be used by any process that has access permissions. It's quite hard and cumbersome to manipulate and administer shared memory objects using only shm*() routines and without file system facilities such as directories, file names, ownership, access permissions, etc., as illustrated by the history of System V shared memory objects.
I don't think there's a particularly good reason to use that filesystem for other uses. Just mount another tmpfs elsewhere.
mount() requires CAP_SYS_ADMIN and therefore an application cannot rely on performing mounts. A major point of this thread is that an application wants to rely on using a file system that is present on all boxes, can be accessed without special permissions or capabilities, and offers very fast performance for small numbers of small-to-medium-sized files. /dev/shm was the best choice until systemd in Fedora 15 rawhide mounted /dev/shm with MS_NOEXEC. Even the preview edition of Ubuntu 11.04 omits the MS_NOEXEC.
On Wed, 15.12.10 08:44, John Reiser (jreiser@bitwagon.com) wrote:
I don't think there's a particularly good reason to use that filesystem for other uses. Just mount another tmpfs elsewhere.
mount() requires CAP_SYS_ADMIN and therefore an application cannot rely on performing mounts. A major point of this thread is that an application wants to rely on using a file system that is present on all boxes, can be accessed without special permissions or capabilities, and offers very fast performance for small numbers of small-to-medium-sized files. /dev/shm was the best choice until systemd in Fedora 15 rawhide mounted /dev/shm with MS_NOEXEC. Even the preview edition of Ubuntu 11.04 omits the MS_NOEXEC.
The appropriate place for this is /tmp. /dev/shm always has been an implementation detail of shm_open(), and accessing it directly makes only sense for admins and low-level system tools that are used to manage shared memory areas. But it is not a generic place to dump arbitrary stuff.
The FHS does not offer file systems for all thinkable uses. However, most distributions have subsequently updated its semantics, and depending on your application you might find a more suitable place to place your files. For example, /var/run sounds like it is something you might want to use (if your code is privileged at least), and on more recent distros (including F15) it is actually a tmpfs. Its purpose is that it is used for small files, sockets, fifos that are used for communication between processes and life-cycle management. Newer versions of the XDG basedir spec also specify $XDG_RUNTIME_DIR which offers similar semantics for unprivileged code and whose lifetime is strictly bound to the user being logged in. This is first properly implemented in systemd, and hence will be available properly in F15, too. It too is tmpfs (since it is actually located beneath /var/run).
So, in summary: there are places which might be more appropriate for your small files, but /dev/shm is not a good place for it, and never has been.
Lennart
On 12/14/2010 09:37 PM, Lennart Poettering wrote:
On Sun, 12.12.10 19:49, John Reiser (jreiser@bitwagon.com) wrote:
The project is a database system that creates and dlopen()s plugins on-the-fly, for better performance on ["long-running"] queries. We like the speed of creat+write+close+open+read+mmap on /dev/shm. If /dev/shm and /tmp both become off limits, then what is the recommended replacement location?
The API for /dev/shm is shm_open(). Unless you are using that API you shouldn't really touch /dev/shm.
What's wrong with /tmp for your use cases?
[sorry to be late for this thread, I understand the original message should be treated as a bug of systemd not reapplying stuff from fstab after it was done with its own internal needs]
I would like to bring to the attention of the list another current usage of the tmpfs mounted on /dev/shm in Fedora packages:
Jack (the Jack Audio Connection Kit, jackaudio.org) has been using the file api (apologies if my wording is not absolutely correct in unix terms) on the tmpfs filesystem that is mounted on /dev/shm for a very long time (10 years?). "/tmp" is not useful to Jack because Jack's internal communication pipes can't be stored in any disk based journaled filesystem as the latencies involved in accessing them cause glitches in the audio streams handled by Jack.
I raise this issue because "The API for /dev/shm is shm_open()" statement above means to me that in the future there will be no file api access to a ram mounted filesystem in Fedora (I understand that this is my own conclusion, but I can't see any other given the wording of the statement above). Before someone implements that idea, please consider the needs of a filesystem in ram for such uses as those mentioned in this thread (and that is supported by the Fedora distribution by default). Just in case...
-- Fernando
On Mon, 2010-12-20 at 13:07 -0800, Fernando Lopez-Lezcano wrote:
I would like to bring to the attention of the list another current usage of the tmpfs mounted on /dev/shm in Fedora packages:
Jack (the Jack Audio Connection Kit, jackaudio.org) has been using the file api (apologies if my wording is not absolutely correct in unix terms) on the tmpfs filesystem that is mounted on /dev/shm for a very long time (10 years?). "/tmp" is not useful to Jack because Jack's internal communication pipes can't be stored in any disk based journaled filesystem as the latencies involved in accessing them cause glitches in the audio streams handled by Jack.
This is right and wrong.
JACK uses /dev/shm for two purposes on Linux [1]. The first is as the definition of what its configure script calls HOST_DEFAULT_TMP_DIR. This path is only used as a name to which to attach the jack sockets. The extent to which this will _ever_ touch the disk, even on a journaled filesystem, is:
- eventually, the inode for that socket and the dnode for the containing directory will have to be written to the disk, once.
- under memory pressure the vfs may decide to throw away the inode cache for that socket, which would then have to be re-read from disk for subsequent connecting JACK clients.
In other words, these are setup costs, not maintenance costs. This may cause glitches in a realtime scenario to the extent that clients are created and destroyed, but in general I submit that the cost of exec() of those new clients is going to dwarf the cost of the inode cache miss for the JACK socket. [2]
The other usage of /dev/shm is for actual shared memory segments, but the shm layer in jack uses shm_open() and friends, so the use of /dev/shm is simply glibc's implementation detail.
[1] - I have read the JACK source literally once in my life (ie, just now), and I do not claim to be an expert, but this is all I was able to find.
[2] - Though, should someone feel especially enterprising, it would probably be a worthwhile optimization to tweak the inode cache replacement to prefer dropping regular files to sockets, on the grounds that IPC should not be a disk operation. If it doesn't already; I haven't looked.
- ajax
On 12/20/2010 02:17 PM, Adam Jackson wrote:
On Mon, 2010-12-20 at 13:07 -0800, Fernando Lopez-Lezcano wrote:
I would like to bring to the attention of the list another current usage of the tmpfs mounted on /dev/shm in Fedora packages:
Jack (the Jack Audio Connection Kit, jackaudio.org) has been using the file api (apologies if my wording is not absolutely correct in unix terms) on the tmpfs filesystem that is mounted on /dev/shm for a very long time (10 years?). "/tmp" is not useful to Jack because Jack's internal communication pipes can't be stored in any disk based journaled filesystem as the latencies involved in accessing them cause glitches in the audio streams handled by Jack.
This is right and wrong.
Right! Thanks very much for looking at this in such detail (I presume you looked at the 1.9.6 code base?).
JACK uses /dev/shm for two purposes on Linux [1]. The first is as the definition of what its configure script calls HOST_DEFAULT_TMP_DIR. This path is only used as a name to which to attach the jack sockets. The extent to which this will _ever_ touch the disk, even on a journaled filesystem, is:
- eventually, the inode for that socket and the dnode for the containing
directory will have to be written to the disk, once.
- under memory pressure the vfs may decide to throw away the inode cache
for that socket, which would then have to be re-read from disk for subsequent connecting JACK clients.
In other words, these are setup costs, not maintenance costs. This may cause glitches in a realtime scenario to the extent that clients are created and destroyed, but in general I submit that the cost of exec() of those new clients is going to dwarf the cost of the inode cache miss for the JACK socket. [2]
My experience (caveat: a long time ago, maybe everything has changed internally in both jack and the kernel and that has invalidated my experience cache :-) was that using /tmp would lead to constant - not all the time, but very frequent and not correlated with client connection/disconnection - xruns (glitches in the audio), using /dev/shm would fix that immediately. That was why things were moved over to /dev/shm if I remember correctly.
The other usage of /dev/shm is for actual shared memory segments, but the shm layer in jack uses shm_open() and friends, so the use of /dev/shm is simply glibc's implementation detail.
Thanks, -- Fernando
[1] - I have read the JACK source literally once in my life (ie, just now), and I do not claim to be an expert, but this is all I was able to find.
[2] - Though, should someone feel especially enterprising, it would probably be a worthwhile optimization to tweak the inode cache replacement to prefer dropping regular files to sockets, on the grounds that IPC should not be a disk operation. If it doesn't already; I haven't looked.
On 12/20/2010 05:26 PM, Fernando Lopez-Lezcano wrote:
On 12/20/2010 02:17 PM, Adam Jackson wrote:
On Mon, 2010-12-20 at 13:07 -0800, Fernando Lopez-Lezcano wrote:
I would like to bring to the attention of the list another current usage of the tmpfs mounted on /dev/shm in Fedora packages:
Jack (the Jack Audio Connection Kit, jackaudio.org) has been using the file api (apologies if my wording is not absolutely correct in unix terms) on the tmpfs filesystem that is mounted on /dev/shm for a very long time (10 years?). "/tmp" is not useful to Jack because Jack's internal communication pipes can't be stored in any disk based journaled filesystem as the latencies involved in accessing them cause glitches in the audio streams handled by Jack.
This is right and wrong.
Right! Thanks very much for looking at this in such detail (I presume you looked at the 1.9.6 code base?).
This is from Paul Davis, the main architect of Jack (I forwarded him your post):
---- this isn't exactly correct.
in /dev/shm on linux we have:
(a) unix-domain sockets for non-RT communication with the server (b) FIFOs for RT wakeups (this could use semaphores now) (c) shared memory created via either the sysv or posix shm API
we don't care about the unix domain sockets' performance characteristics, but its convenient to have them in a known location that happens to be close to where (b) is located.
we do care about the performance of (b)
(c) just works. ----
-- Fernando
On Mon, Dec 20, 2010 at 07:16:21PM -0800, Fernando Lopez-Lezcano wrote:
This is from Paul Davis, the main architect of Jack (I forwarded him your post):
this isn't exactly correct.
in /dev/shm on linux we have:
(a) unix-domain sockets for non-RT communication with the server
Perhaps these could become abstract domain sockets.
--CJD
On 12/22/2010 12:56 PM, Casey Dahlin wrote:
On Mon, Dec 20, 2010 at 07:16:21PM -0800, Fernando Lopez-Lezcano wrote:
This is from Paul Davis, the main architect of Jack (I forwarded him your post):
this isn't exactly correct.
in /dev/shm on linux we have:
(a) unix-domain sockets for non-RT communication with the server
Perhaps these could become abstract domain sockets.
Could you explain a bit perhaps? I'm not familiar with them... (or maybe you have a url I could surf to?)
Anyway, the main concern re: the subject of this thread is:
(b) FIFOs for RT wakeups (this could use semaphores now) we do care about the performance of (b)
These have to be very fast as they are used for waking up the next client in the round robbin transfer of control between jackd and its clients (and that's why they are in /dev/shm).
-- Fernando
On Thu, 2010-12-23 at 09:11 -0800, Fernando Lopez-Lezcano wrote:
On 12/22/2010 12:56 PM, Casey Dahlin wrote:
On Mon, Dec 20, 2010 at 07:16:21PM -0800, Fernando Lopez-Lezcano wrote:
(a) unix-domain sockets for non-RT communication with the server
Perhaps these could become abstract domain sockets.
Could you explain a bit perhaps? I'm not familiar with them... (or maybe you have a url I could surf to?)
See the unix(7) man page. In /proc/net/unix, the abstract-namespace sockets are listed starting with an @ sign.
On Thu, Dec 23, 2010 at 09:11:46AM -0800, Fernando Lopez-Lezcano wrote:
On 12/22/2010 12:56 PM, Casey Dahlin wrote:
On Mon, Dec 20, 2010 at 07:16:21PM -0800, Fernando Lopez-Lezcano wrote:
This is from Paul Davis, the main architect of Jack (I forwarded him your post):
this isn't exactly correct.
in /dev/shm on linux we have:
(a) unix-domain sockets for non-RT communication with the server
Perhaps these could become abstract domain sockets.
Could you explain a bit perhaps? I'm not familiar with them... (or maybe you have a url I could surf to?)
Basically, you put a \0 in front of the path when you bind the socket. So, for example, bind to "\0/jack/socket". Yes, that looks weird, but it works. The socket will not appear anywhere in the filesystem, but can still be opened by using that wonky path from anywhere. When no longer referenced the socket will simply disappear.
Here's a link, though it takes awhile to get to the point: http://blog.eduardofleury.com/archives/2007/09/13/
--CJD
On Sat, 25.12.10 11:51, Casey Dahlin (cdahlin@redhat.com) wrote:
Could you explain a bit perhaps? I'm not familiar with them... (or maybe you have a url I could surf to?)
Basically, you put a \0 in front of the path when you bind the socket. So, for example, bind to "\0/jack/socket". Yes, that looks weird, but it works. The socket will not appear anywhere in the filesystem, but can still be opened by using that wonky path from anywhere. When no longer referenced the socket will simply disappear.
Here's a link, though it takes awhile to get to the point: http://blog.eduardofleury.com/archives/2007/09/13/
BTW: I can only ask everybody to be very careful with abstract namespace sockets, since there is no access control applied to the namespace: everbody can allocate any socket. If jack would hardcode the socket it uses to \0/org/jack/socket or so, then a) only one user could run jack at a time, b) an evil user could simply allocate that socket and thus ensure that nobody else can run jack anymore (DoS) and c) jack clients of other users might try to connect to a jack instance belonging to one user, which might create confusion and errors.
If you place a socket in a dir such as $XDG_RUNTIME_DIR these problems don't exist, since that dir belongs to the user, and only the user, so nobody else can allocate sockets in it or connect to it, which fixes the problems pointed out above.
Or to turn this around: abstract namespace sockets are only safe to use if they:
a) use a randomized name (which makes them a less lot useful, since you need to add some additional logic to find out what name they have for your application)
or
b) use a fixed name, but only by a system daemon that is started early at boot (i.e. at a time were no evil user could be logged in) and is never restarted (so that no time window exists wher the socket is unallocated during normal runtime where evil users could take advantage of).
That basically means that besides systemd itself and maybe the D-Bus system bus almost nobody can safely use fixed name abstract namespace sockets. In particular user code that uses fixed name abstract namespace sockets is necessarily vulnerable to DoS attacks.
Yes, abstract namespace sockets only have a very limited use.
Lennart
On Sat, 2010-12-25 at 19:37 +0100, Lennart Poettering wrote:
That basically means that besides systemd itself and maybe the D-Bus system bus almost nobody can safely use fixed name abstract namespace sockets. In particular user code that uses fixed name abstract namespace sockets is necessarily vulnerable to DoS attacks.
Yes, abstract namespace sockets only have a very limited use.
On my desktop, abstract namespace sockets are twice more popular than the regular ones:
bernie@giskard:~$ netstat -ax | grep @ | wc -l 151 bernie@giskard:~$ netstat -ax | grep -v @ | grep / | wc -l 73
Most uses are from dbus, but I'm also seeing gnome-session and gvfsd-trash.
On Mon, 03.01.11 22:12, Bernie Innocenti (bernie@codewiz.org) wrote:
On my desktop, abstract namespace sockets are twice more popular than the regular ones:
bernie@giskard:~$ netstat -ax | grep @ | wc -l 151 bernie@giskard:~$ netstat -ax | grep -v @ | grep / | wc -l 73
Most uses are from dbus, but I'm also seeing gnome-session and gvfsd-trash.
Of these being used, dbus is correctly implemented, since it randomizes the socket name. Same for gdm.
Misusing are ICE, X11, nspluginwrapper at least, since they do not use a random socket name but a fixed one, hence opening the door to DoS attacks.
Lennart
On Tue, 2011-01-04 at 14:11 +0100, Lennart Poettering wrote:
Misusing are ICE, X11, nspluginwrapper at least, since they do not use a random socket name but a fixed one, hence opening the door to DoS attacks.
X's socket name isn't fixed. It's a function of whatever display name you asked for when you launched the server. Our filesystem-bound socket name is not different in this respect.
- ajax
On Tue, 04.01.11 17:36, Adam Jackson (ajax@redhat.com) wrote:
On Tue, 2011-01-04 at 14:11 +0100, Lennart Poettering wrote:
Misusing are ICE, X11, nspluginwrapper at least, since they do not use a random socket name but a fixed one, hence opening the door to DoS attacks.
X's socket name isn't fixed. It's a function of whatever display name you asked for when you launched the server. Our filesystem-bound socket name is not different in this respect.
Well, OK, bad wording on my side. Replace "fixed" by "guessable".
Lennart
On Wed, 2011-01-05 at 00:59 +0100, Lennart Poettering wrote:
Well, OK, bad wording on my side. Replace "fixed" by "guessable".
What sort of attack would this enable?
Wait... any unprivileged process can create sockets in the abstract namespace? Uh-oh.
On Tue, Jan 4, 2011 at 4:31 PM, Bernie Innocenti bernie@codewiz.org wrote:
What sort of attack would this enable?
Wait... any unprivileged process can create sockets in the abstract namespace? Uh-oh.
Any unprivileged process can prevent you from running X on a given display by using up the socket name that X wants to use. This is a textbook DOS scenario.
On Tue, Jan 04, 2011 at 05:42:12PM -0800, Garrett Holmstrom wrote:
On Tue, Jan 4, 2011 at 4:31 PM, Bernie Innocenti bernie@codewiz.org wrote:
What sort of attack would this enable?
Wait... any unprivileged process can create sockets in the abstract namespace? Uh-oh.
Any unprivileged process can prevent you from running X on a given display by using up the socket name that X wants to use. This is a textbook DOS scenario.
If we have private /tmp this problem would go away.
Rich.
On Fri, 2011-01-07 at 11:46 +0000, Richard W.M. Jones wrote:
On Tue, Jan 04, 2011 at 05:42:12PM -0800, Garrett Holmstrom wrote:
On Tue, Jan 4, 2011 at 4:31 PM, Bernie Innocenti bernie@codewiz.org wrote:
What sort of attack would this enable?
Wait... any unprivileged process can create sockets in the abstract namespace? Uh-oh.
Any unprivileged process can prevent you from running X on a given display by using up the socket name that X wants to use. This is a textbook DOS scenario.
If we have private /tmp this problem would go away.
If we had private /tmp this would not go away, because the user starting the X server is not always the user whose session it belongs to. Putting the socket in gdm's /tmp means it won't be someplace where rjones can get to it.
Also because multiple users on the same display is a completely valid use case that people actually do.
- ajax
On Tue, 2011-01-04 at 14:11 +0100, Lennart Poettering wrote:
Of these being used, dbus is correctly implemented, since it randomizes the socket name. Same for gdm.
The relevant point is not randomness or unguessability, but that dbus chooses an available name and passes the actual name being used to clients (via the DBUS_SESSION_BUS_ADDRESS environment variable).
However, even this may not be enough if the session dbus-daemon dies for any reason and an attacker takes over the name and sends malicious responses. It would be preferable if process death cases (the OOM-killer, even) did not automatically become security holes. I'm not sure how best to solve this. Wean ourselves from the convenience of the abstract namespace and go back to filesystem sockets in places only writable by appropriate parties?
On Tue, 04.01.11 21:31, Matt McCutchen (matt@mattmccutchen.net) wrote:
On Tue, 2011-01-04 at 14:11 +0100, Lennart Poettering wrote:
Of these being used, dbus is correctly implemented, since it randomizes the socket name. Same for gdm.
The relevant point is not randomness or unguessability, but that dbus chooses an available name and passes the actual name being used to clients (via the DBUS_SESSION_BUS_ADDRESS environment variable).
However, even this may not be enough if the session dbus-daemon dies for any reason and an attacker takes over the name and sends malicious responses. It would be preferable if process death cases (the OOM-killer, even) did not automatically become security holes. I'm not sure how best to solve this. Wean ourselves from the convenience of the abstract namespace and go back to filesystem sockets in places only writable by appropriate parties?
That's precisely what I want to tell people: don't use the abstract socket namespace, unless you really know what you do. The only cases where it really makes sense to use it is if you have a privileged service that i sstarted before any user code and never goes away and hence is not vulnerable to these problems. The D-Bus system bus, the init systemd and udev are probably the only ones really qualifying for that. Everything else is restartable.
Lennart
On Wed, 2011-01-05 at 13:52 +0100, Lennart Poettering wrote:
On Tue, 04.01.11 21:31, Matt McCutchen (matt@mattmccutchen.net) wrote:
On Tue, 2011-01-04 at 14:11 +0100, Lennart Poettering wrote:
Of these being used, dbus is correctly implemented, since it randomizes the socket name. Same for gdm.
The relevant point is not randomness or unguessability, but that dbus chooses an available name and passes the actual name being used to clients (via the DBUS_SESSION_BUS_ADDRESS environment variable).
However, even this may not be enough if the session dbus-daemon dies for any reason and an attacker takes over the name and sends malicious responses. It would be preferable if process death cases (the OOM-killer, even) did not automatically become security holes. I'm not sure how best to solve this. Wean ourselves from the convenience of the abstract namespace and go back to filesystem sockets in places only writable by appropriate parties?
That's precisely what I want to tell people: don't use the abstract socket namespace, unless you really know what you do. The only cases where it really makes sense to use it is if you have a privileged service that i sstarted before any user code and never goes away and hence is not vulnerable to these problems.
But as I said, it's impossible to guarantee that the service never goes away. It could crash or get OOM-killed (or terminate before all potential clients have terminated during system shutdown: is that possible?), and then you have a security hole. So I would recommend filesystem sockets for everything.
On Wed, 05.01.11 09:39, Matt McCutchen (matt@mattmccutchen.net) wrote:
That's precisely what I want to tell people: don't use the abstract socket namespace, unless you really know what you do. The only cases where it really makes sense to use it is if you have a privileged service that i sstarted before any user code and never goes away and hence is not vulnerable to these problems.
But as I said, it's impossible to guarantee that the service never goes away. It could crash or get OOM-killed (or terminate before all potential clients have terminated during system shutdown: is that possible?), and then you have a security hole. So I would recommend filesystem sockets for everything.
Well, if PID 1 terminates the kernel halts the system. And udev fiddles with its OOM score to avoid being killed. And if the dbus system bus goes away the system becomes kinda unusable too.
These three services are kinda essential, if they go away the system is dead. And given that this is how it is, these three are most likely the only ones where it is safe that they use abstract namespace sockets.
Lennart
On Wed, 2011-01-05 at 16:35 +0100, Lennart Poettering wrote:
On Wed, 05.01.11 09:39, Matt McCutchen (matt@mattmccutchen.net) wrote:
That's precisely what I want to tell people: don't use the abstract socket namespace, unless you really know what you do. The only cases where it really makes sense to use it is if you have a privileged service that i sstarted before any user code and never goes away and hence is not vulnerable to these problems.
But as I said, it's impossible to guarantee that the service never goes away. It could crash or get OOM-killed (or terminate before all potential clients have terminated during system shutdown: is that possible?), and then you have a security hole. So I would recommend filesystem sockets for everything.
Well, if PID 1 terminates the kernel halts the system.
Valid point.
And udev fiddles with its OOM score to avoid being killed.
There could still be a bug that causes udev to crash. As a general principle, systems should fail secure.
And if the dbus system bus goes away the system becomes kinda unusable too.
Whether system features break for legitimate users is beside the point. As long as user applications are still running, they may connect to the system bus and be tricked into doing something harmful by an attacker who impersonates it.
On Wed, 2011-01-05 at 13:52 +0100, Lennart Poettering wrote:
That's precisely what I want to tell people: don't use the abstract socket namespace, unless you really know what you do. The only cases where it really makes sense to use it is if you have a privileged service that i sstarted before any user code and never goes away and hence is not vulnerable to these problems. The D-Bus system bus, the init systemd and udev are probably the only ones really qualifying for that. Everything else is restartable.
Fedora's X has a patch [1] (which I'm almost certain has been posted upstream, and certainly sounded like it had approval at the most recent XDS when it came up) where the X server will simply _pick_ a (set of) display socket(s) not already bound, and tell you what it picked on a file descriptor you pass in from the launching process. Which neatly avoids this kind of DoS, and also eliminates the failure case in gdm that causes the display seizure of doom when X fails to start for other reasons. Right now the only thing using this is the selinux X sandbox, but it's certainly generally applicable.
The deeper problem is that clients authenticate themselves to the server, but then simply trust that the server is the server they were hoping for. If you don't have a process tree relationship (like the gdm +displayfd case) then you have to go all the way to something like Kerberos for that kind of bidirectional auth. Simply moving back to filesystem sockets does not solve that - and indeed, has _more_ DoS conditions than abstract sockets since they don't get garbage-collected on system crash - so simply proscribing the use of abstract sockets seems a little harsh.
(And of course what we're doing here is protecting against a malicious attacker who already has enough privileges to run code on your system, which means you're pretty far into having already lost. Meh.)
[1] - http://pkgs.fedoraproject.org/gitweb/?p=xorg-x11-server.git;a=blob_plain;f=x...
- ajax
On Wed, 2011-01-05 at 11:12 -0500, Adam Jackson wrote:
The deeper problem is that clients authenticate themselves to the server, but then simply trust that the server is the server they were hoping for. If you don't have a process tree relationship (like the gdm +displayfd case) then you have to go all the way to something like Kerberos for that kind of bidirectional auth.
Not quite: you can use the xauth cookie as a pre-shared key.
Simply moving back to filesystem sockets does not solve that -
Right; what solves it is putting the socket in a place that is writable only by the user running the server.
and indeed, has _more_ DoS conditions than abstract sockets since they don't get garbage-collected on system crash
They do if you use a tmpfs (e.g., /var/run with systemd), but in any event it's easy enough to unlink the socket or try another name. The more significant DoS condition is another user taking the name you want, which can happen in the abstract namespace but not in a directory only you can write.
On Wed, 2011-01-05 at 13:38 -0500, Matt McCutchen wrote:
On Wed, 2011-01-05 at 11:12 -0500, Adam Jackson wrote:
The deeper problem is that clients authenticate themselves to the server, but then simply trust that the server is the server they were hoping for. If you don't have a process tree relationship (like the gdm +displayfd case) then you have to go all the way to something like Kerberos for that kind of bidirectional auth.
Not quite: you can use the xauth cookie as a pre-shared key.
That doesn't work. If you're trying to spoof the X server then you write an X server that simply accepts whatever auth cookie you give it. There's no way, once you've connected to X, to ask what cookies it accepts. (Because different auth cookies can have different security levels, so you don't want to disclose to a less-trusted level the cookie of a more-trusted level.) The only way you can know what cookies it accepts is if you started it yourself; and if you did that, you have a process tree relationship.
and indeed, has _more_ DoS conditions than abstract sockets since they don't get garbage-collected on system crash
They do if you use a tmpfs (e.g., /var/run with systemd), but in any event it's easy enough to unlink the socket or try another name.
Your attacker wants to spoof a service. They've created a socket on the name you want, and now you want to unlink it and make your own. Why do you think they can't notice the unlink and recreate the socket? Thus making it a race you might win sometimes, depending on how the scheduler is feeling on any given day.
The more significant DoS condition is another user taking the name you want, which can happen in the abstract namespace but not in a directory only you can write.
I don't have any of those. If the X server is running as root (like in the gdm case) then I can put the socket wherever I want. If it's Xvfb, then where do I put this directory? $HOME ? Nope, might not be there. /tmp/$USER ? Won't work if someone else mkdir'd /tmp/ajax before I did.
- ajax
On Wed, 2011-01-05 at 15:25 -0500, Adam Jackson wrote:
On Wed, 2011-01-05 at 13:38 -0500, Matt McCutchen wrote:
The more significant DoS condition is another user taking the name you want, which can happen in the abstract namespace but not in a directory only you can write.
I don't have any of those. If the X server is running as root (like in the gdm case) then I can put the socket wherever I want. If it's Xvfb, then where do I put this directory? $HOME ? Nope, might not be there. /tmp/$USER ? Won't work if someone else mkdir'd /tmp/ajax before I did.
What about the XDG_RUNTIME_DIR (/var/run/user/$USER) from systemd?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 01/05/2011 04:33 PM, Matt McCutchen wrote:
On Wed, 2011-01-05 at 15:25 -0500, Adam Jackson wrote:
On Wed, 2011-01-05 at 13:38 -0500, Matt McCutchen wrote:
The more significant DoS condition is another user taking the name you want, which can happen in the abstract namespace but not in a directory only you can write.
I don't have any of those. If the X server is running as root (like in the gdm case) then I can put the socket wherever I want. If it's Xvfb, then where do I put this directory? $HOME ? Nope, might not be there. /tmp/$USER ? Won't work if someone else mkdir'd /tmp/ajax before I did.
What about the XDG_RUNTIME_DIR (/var/run/user/$USER) from systemd?
This does not exist until after the User has logged in. X starts before the user logs in. Also multiple users need to be able to talk to same xserver. Not sure about switchuser.
On Wed, 2011-01-05 at 16:37 -0500, Daniel J Walsh wrote:
[XDG_RUNTIME_DIR] does not exist until after the User has logged in. X starts before the user logs in. Also multiple users need to be able to talk to same xserver.
On Wed, 2011-01-05 at 16:47 -0500, Adam Jackson wrote:
atropine:~% ssh 10.16.61.101 test@10.16.61.101's password: Last login: Wed Jan 5 16:42:43 2011 [test@dhcp-10-16-61-101 ~]$ set | grep XDG [test@dhcp-10-16-61-101 ~]$ rpm -q systemd fedora-release systemd-15-1.fc15.x86_64 fedora-release-15-0.3.noarch
Console login at least gives me an XDG_SESSION_COOKIE.
Yes, I guess XDG_RUNTIME_DIR won't work in its current form, but it should be easy enough for systemd to provide directories with the necessary permissions at the necessary times. I think this is the right solution.
On Wed, 2011-01-05 at 16:33 -0500, Matt McCutchen wrote:
On Wed, 2011-01-05 at 15:25 -0500, Adam Jackson wrote:
I don't have any of those. If the X server is running as root (like in the gdm case) then I can put the socket wherever I want. If it's Xvfb, then where do I put this directory? $HOME ? Nope, might not be there. /tmp/$USER ? Won't work if someone else mkdir'd /tmp/ajax before I did.
What about the XDG_RUNTIME_DIR (/var/run/user/$USER) from systemd?
atropine:~% ssh 10.16.61.101 test@10.16.61.101's password: Last login: Wed Jan 5 16:42:43 2011 [test@dhcp-10-16-61-101 ~]$ set | grep XDG [test@dhcp-10-16-61-101 ~]$ rpm -q systemd fedora-release systemd-15-1.fc15.x86_64 fedora-release-15-0.3.noarch
Console login at least gives me an XDG_SESSION_COOKIE.
- ajax
On Wed, 05.01.11 16:47, Adam Jackson (ajax@redhat.com) wrote:
On Wed, 2011-01-05 at 16:33 -0500, Matt McCutchen wrote:
On Wed, 2011-01-05 at 15:25 -0500, Adam Jackson wrote:
I don't have any of those. If the X server is running as root (like in the gdm case) then I can put the socket wherever I want. If it's Xvfb, then where do I put this directory? $HOME ? Nope, might not be there. /tmp/$USER ? Won't work if someone else mkdir'd /tmp/ajax before I did.
What about the XDG_RUNTIME_DIR (/var/run/user/$USER) from systemd?
atropine:~% ssh 10.16.61.101 test@10.16.61.101's password: Last login: Wed Jan 5 16:42:43 2011 [test@dhcp-10-16-61-101 ~]$ set | grep XDG [test@dhcp-10-16-61-101 ~]$ rpm -q systemd fedora-release systemd-15-1.fc15.x86_64 fedora-release-15-0.3.noarch
Console login at least gives me an XDG_SESSION_COOKIE.
That should work. Probably during upgrade the PAM files weren't corrected. Try invoking "authconfig".
XDG_SESSION_COOKIE is supposed to be secret and is probably going to go away soonishly, as it is obsolete now that we have /proc/self/loginuid.
Lennart
An aside:
On Wed, 2011-01-05 at 11:12 -0500, Adam Jackson wrote:
(And of course what we're doing here is protecting against a malicious attacker who already has enough privileges to run code on your system, which means you're pretty far into having already lost. Meh.)
I've seen this viewpoint a number of places. IMO, it's a shame that the community seems to be giving up on local system security. In various situations, it would be quite convenient if I could give other people shell accounts on my machine without risking compromise of all of my data. The virtualization solutions are more work to set up. If what you say is right, the many schools that still use large shared shell servers are relying on their users not to be too evil, or alternatively the users shouldn't use the servers for anything important.
On Wed, 2011-01-05 at 14:10 -0500, Matt McCutchen wrote:
On Wed, 2011-01-05 at 11:12 -0500, Adam Jackson wrote:
(And of course what we're doing here is protecting against a malicious attacker who already has enough privileges to run code on your system, which means you're pretty far into having already lost. Meh.)
I've seen this viewpoint a number of places. IMO, it's a shame that the community seems to be giving up on local system security. In various situations, it would be quite convenient if I could give other people shell accounts on my machine without risking compromise of all of my data. The virtualization solutions are more work to set up.
You're putting words in my mouth just a little.
The existing discussion was about denial of service attacks. The case I was making is that adequate defense against DoS requires programming techniques more subtle than simple prohibition of abstract sockets, and (more broadly) a system that assures that resources are fairly allocated, for arbitrarily complex definitions of "fair". If you have a malicious user who can run code on your machine, you've granted him CPU time. You have already lost. You're deciding how much to lose.
The position you're painting me in is in opposition to:
"[...] risking compromise of all my data [...]"
and at no point was I arguing that access control or integrity were unimportant. If they weren't, we wouldn't bother with xauth at all. And they are concepts that are entirely achievable even within the unix model. You're still relying on the absence of bugs, but okay, that's always the gamble we make.
But prevention of DoS on the part of local actors is just not a game you can win. If nothing else, remember that the way Linux implements malloc() assumes you have infinite memory, which means you overcommit resources, which means failure happens. You can write code that prevents many DoS conditions and that's almost always worthwhile, but at the end of the day it's a system with overcommit and therefore you either need trust in your participants or policy to rein them in.
DoS simply is not a security issue. There are many other adjectives you can apply to it - availability, reliability, quality, usability; desirable qualities all - but security is not one of them.
If what you say is right, the many schools that still use large shared shell servers are relying on their users not to be too evil, or alternatively the users shouldn't use the servers for anything important.
That's been true since at least the RTM worm.
- ajax
On Wed, Jan 5, 2011 at 4:13 PM, Adam Jackson ajax@redhat.com wrote:
But prevention of DoS on the part of local actors is just not a game you can win. If nothing else, remember that the way Linux implements malloc() assumes you have infinite memory, which means you overcommit resources, which means failure happens. You can write code that
[snip]
# echo 2 > /proc/sys/vm/overcommit_memory # echo 0 > /proc/sys/vm/overcommit_ratio
:)
(and good luck with that!)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 01/05/2011 04:38 PM, Gregory Maxwell wrote:
On Wed, Jan 5, 2011 at 4:13 PM, Adam Jackson ajax@redhat.com wrote:
But prevention of DoS on the part of local actors is just not a game you can win. If nothing else, remember that the way Linux implements malloc() assumes you have infinite memory, which means you overcommit resources, which means failure happens. You can write code that
[snip]
# echo 2 > /proc/sys/vm/overcommit_memory # echo 0 > /proc/sys/vm/overcommit_ratio
:)
(and good luck with that!)
BTW SELinux confined users and cgroups can help somewhat control those nasty students, but stopping a DOS will still be difficult.
On Wed, 2011-01-05 at 16:13 -0500, Adam Jackson wrote:
On Wed, 2011-01-05 at 14:10 -0500, Matt McCutchen wrote:
On Wed, 2011-01-05 at 11:12 -0500, Adam Jackson wrote:
(And of course what we're doing here is protecting against a malicious attacker who already has enough privileges to run code on your system, which means you're pretty far into having already lost. Meh.)
I've seen this viewpoint a number of places. IMO, it's a shame that the community seems to be giving up on local system security. In various situations, it would be quite convenient if I could give other people shell accounts on my machine without risking compromise of all of my data. The virtualization solutions are more work to set up.
You're putting words in my mouth just a little.
The existing discussion was about denial of service attacks.
OK, I misunderstood. Reading your remark by itself, I thought it referred to confidentiality and integrity too.
On Wed, 05 Jan 2011 16:13:25 -0500 Adam Jackson ajax@redhat.com wrote:
But prevention of DoS on the part of local actors is just not a game you can win. If nothing else, remember that the way Linux implements malloc() assumes you have infinite memory, which means you overcommit resources, which means failure happens.
As long as we say things like the first one, Oracle will continue to pretend that Solaris is somehow more suitable to deploy Sunray... As for the second one, look here (we ship with overcommit set to heuristic, which is Webkit crashes in Rawhide): https://bugzilla.redhat.com/show_bug.cgi?id=648319#c63
-- Pete
On Mon, 20.12.10 19:16, Fernando Lopez-Lezcano (nando@ccrma.Stanford.EDU) wrote:
this isn't exactly correct.
in /dev/shm on linux we have:
(a) unix-domain sockets for non-RT communication with the server (b) FIFOs for RT wakeups (this could use semaphores now)
If this uses O_NOATIME it shouldnt matter whether the backing fs is tmpfs or real disk.
(c) shared memory created via either the sysv or posix shm API
As mentioned by other people too: sysv shm is not placed in /dev/shm. It lives in an independent non-fs namespace.
we don't care about the unix domain sockets' performance characteristics, but its convenient to have them in a known location that happens to be close to where (b) is located.
we do care about the performance of (b)
If O_NOATIME is not the answer to your questions, then you could even pass the fifo fd via the unix socket and have it completely independent of any real fs.
Lennart
On Thu, 2010-12-23 at 22:59 +0100, Lennart Poettering wrote:
On Mon, 20.12.10 19:16, Fernando Lopez-Lezcano (nando@ccrma.Stanford.EDU) wrote:
this isn't exactly correct.
in /dev/shm on linux we have:
(a) unix-domain sockets for non-RT communication with the server (b) FIFOs for RT wakeups (this could use semaphores now)
If this uses O_NOATIME it shouldnt matter whether the backing fs is tmpfs or real disk.
Sadly this turns out not to be the case, at least if I'm reading fs/pipe.c correctly. O_NOATIME will turn off atime updates, but mtime and ctime are still modified on every pipe write, and there's no such thing as O_NOCMTIME even though the filesystem layer does have the concept internally. Which means device-backed filesystems will see write traffic just for using named pipes.
Heck of lame. Someone should fix that.
- ajax
Once upon a time, Adam Jackson ajax@redhat.com said:
Sadly this turns out not to be the case, at least if I'm reading fs/pipe.c correctly. O_NOATIME will turn off atime updates, but mtime and ctime are still modified on every pipe write, and there's no such thing as O_NOCMTIME even though the filesystem layer does have the concept internally. Which means device-backed filesystems will see write traffic just for using named pipes.
Heck of lame. Someone should fix that.
The behavior follows the standard, so it shouldn't just be changed by default without checking if anybody uses the standard behavior.
On Mon, 03.01.11 09:54, Chris Adams (cmadams@hiwaay.net) wrote:
Once upon a time, Adam Jackson ajax@redhat.com said:
Sadly this turns out not to be the case, at least if I'm reading fs/pipe.c correctly. O_NOATIME will turn off atime updates, but mtime and ctime are still modified on every pipe write, and there's no such thing as O_NOCMTIME even though the filesystem layer does have the concept internally. Which means device-backed filesystems will see write traffic just for using named pipes.
Heck of lame. Someone should fix that.
The behavior follows the standard, so it shouldn't just be changed by default without checking if anybody uses the standard behavior.
Well, I think introducing O_NOCTIME the same way O_NOATIME was introduced would be unproblematic: only if it is set the normal ctime behaviour would be disabled.
But yeah, I agree with ajax, the fact that the ctime of a fifo is updated all the time and there is no way around it is kinda ridiculous... And it gives the jack folks a really good reason not to stick a fifo into /tmp.
Lennart
On Tue, Dec 21, 2010 at 2:26 AM, Fernando Lopez-Lezcano nando@ccrma.stanford.edu wrote:
On 12/20/2010 02:17 PM, Adam Jackson wrote:
On Mon, 2010-12-20 at 13:07 -0800, Fernando Lopez-Lezcano wrote:
I would like to bring to the attention of the list another current usage of the tmpfs mounted on /dev/shm in Fedora packages:
Jack (the Jack Audio Connection Kit, jackaudio.org) has been using the file api (apologies if my wording is not absolutely correct in unix terms) on the tmpfs filesystem that is mounted on /dev/shm for a very long time (10 years?). "/tmp" is not useful to Jack because Jack's internal communication pipes can't be stored in any disk based journaled filesystem as the latencies involved in accessing them cause glitches in the audio streams handled by Jack.
This is right and wrong.
Right! Thanks very much for looking at this in such detail (I presume you looked at the 1.9.6 code base?).
JACK uses /dev/shm for two purposes on Linux [1]. The first is as the definition of what its configure script calls HOST_DEFAULT_TMP_DIR. This path is only used as a name to which to attach the jack sockets. The extent to which this will _ever_ touch the disk, even on a journaled filesystem, is:
- eventually, the inode for that socket and the dnode for the containing
directory will have to be written to the disk, once.
>
- under memory pressure the vfs may decide to throw away the inode cache
for that socket, which would then have to be re-read from disk for subsequent connecting JACK clients.
In other words, these are setup costs, not maintenance costs. This may cause glitches in a realtime scenario to the extent that clients are created and destroyed, but in general I submit that the cost of exec() of those new clients is going to dwarf the cost of the inode cache miss for the JACK socket. [2]
My experience (caveat: a long time ago, maybe everything has changed internally in both jack and the kernel and that has invalidated my experience cache :-) was that using /tmp would lead to constant - not all the time, but very frequent and not correlated with client connection/disconnection - xruns (glitches in the audio), using /dev/shm would fix that immediately. That was why things were moved over to /dev/shm if I remember correctly.
Well /tmp should be mounted tmpfs anyway (I have been doing this for years and it is working just fine). tmp isn't a persistent storage so it makes a lot of sense, and it is *not* a dumping ground for giant files (apps that try to do that are just broken).
drago01 píše v Čt 23. 12. 2010 v 18:26 +0100:
Well /tmp should be mounted tmpfs anyway (I have been doing this for years and it is working just fine). tmp isn't a persistent storage so it makes a lot of sense, and it is *not* a dumping ground for giant files (apps that try to do that are just broken).
Is there any specific reason to consider applications that store great files to /tmp broken?
In fact, historically the purpose of /tmp is _exactly_ the opposite. For example, sort(1) can be used to sort very large files. Small inputs are kept and stored in memory, large inputs use temporary files in /tmp. The _whole point_ of using /tmp in this case is that it can be stored larger data than what the virtual memory subsystem (or, perhaps, the address space) can handle. If /tmp becomes tmpfs, this useful property of /tmp disappears. Mirek
On Thu, Dec 23, 2010 at 11:26 AM, drago01 drago01@gmail.com wrote:
Well /tmp should be mounted tmpfs anyway (I have been doing this for years and it is working just fine). tmp isn't a persistent storage so it makes a lot of sense, and it is *not* a dumping ground for giant files (apps that try to do that are just broken).
Unfortunately firefox is one of those apps. I experimented with tmpfs /tmp a while back, and ran into very much badness. /tmp rapidly gets all full of large PDFs I've clicked on, as well as the flash plugin seems to like to spool video its streaming in /tmp.
It also likes to not properly clean up after itself. Even without a tmpfs /tmp I've run into fun problems of PDFs and youtube filling up my root and resulting in badness, requiring manual cleanup of /tmp, if I want PDFs and youtube to continue.
This kind of crap belongs in ~/.tmp/ or something. Then it can fill up /home as you would expect users to do and leave root out of it. :P
In fact on my servers I symlink /tmp to /home/tmp, as I like to keep root small as possible and maximize /home. And no, a dedicated /tmp filesystem is silly, why would I want to dedicate a fixed slice of disk space to /tmp that isn't going to be used 99% of the time, and will inevitably turn out to be not big enough %1 of the time?
On Wed, Jan 19, 2011 at 01:11:08PM -0600, Callum Lerwick wrote:
On Thu, Dec 23, 2010 at 11:26 AM, drago01 drago01@gmail.com wrote:
Well /tmp should be mounted tmpfs anyway (I have been doing this for years and it is working just fine). tmp isn't a persistent storage so it makes a lot of sense, and it is *not* a dumping ground for giant files (apps that try to do that are just broken).
Unfortunately firefox is one of those apps. I experimented with tmpfs /tmp a while back, and ran into very much badness. /tmp rapidly gets all full of large PDFs I've clicked on, as well as the flash plugin seems to like to spool video its streaming in /tmp.
In fact on my servers I symlink /tmp to /home/tmp, as I like to keep root small as possible and maximize /home. And no, a dedicated /tmp filesystem is silly, why would I want to dedicate a fixed slice of disk space to /tmp that isn't going to be used 99% of the time, and will inevitably turn out to be not big enough %1 of the time?
You can add a cherry on top of your /home/tmp solution using per-user /tmp: http://fedoraproject.org/wiki/Infrastructure/FedoraPeopleConfig#polyinstanti...
Which is very cool solutions, although orthogonal to the problem described ;)
On 01/19/2011 12:11 PM, Callum Lerwick wrote:
On Thu, Dec 23, 2010 at 11:26 AM, drago01drago01@gmail.com wrote:
Well /tmp should be mounted tmpfs anyway (I have been doing this for years and it is working just fine). tmp isn't a persistent storage so it makes a lot of sense, and it is *not* a dumping ground for giant files (apps that try to do that are just broken).
Unfortunately firefox is one of those apps. I experimented with tmpfs /tmp a while back, and ran into very much badness. /tmp rapidly gets all full of large PDFs I've clicked on, as well as the flash plugin seems to like to spool video its streaming in /tmp.
Playing around with flash spooling, I noticed that Chrome uses ~/.cache/google-chrome... I wonder if firefox and friends should use places like that instead?
Nathanael D. Noblet píše v Čt 20. 01. 2011 v 00:33 -0700:
On 01/19/2011 12:11 PM, Callum Lerwick wrote:
On Thu, Dec 23, 2010 at 11:26 AM, drago01drago01@gmail.com wrote:
Well /tmp should be mounted tmpfs anyway (I have been doing this for years and it is working just fine). tmp isn't a persistent storage so it makes a lot of sense, and it is *not* a dumping ground for giant files (apps that try to do that are just broken).
Unfortunately firefox is one of those apps. I experimented with tmpfs /tmp a while back, and ran into very much badness. /tmp rapidly gets all full of large PDFs I've clicked on, as well as the flash plugin seems to like to spool video its streaming in /tmp.
Playing around with flash spooling, I noticed that Chrome uses ~/.cache/google-chrome... I wonder if firefox and friends should use places like that instead?
If /tmp is not supposed to be used for data that is inconvenient to store in memory for whatever reason, and that should be automatically removed when it is not used, what _is_ it supposed to be used for? Mirek
On Thu, Jan 20, 2011 at 08:37:21AM +0100, Miloslav Trmač wrote:
Nathanael D. Noblet píše v Čt 20. 01. 2011 v 00:33 -0700:
On 01/19/2011 12:11 PM, Callum Lerwick wrote:
On Thu, Dec 23, 2010 at 11:26 AM, drago01drago01@gmail.com wrote:
Well /tmp should be mounted tmpfs anyway (I have been doing this for years and it is working just fine). tmp isn't a persistent storage so it makes a lot of sense, and it is *not* a dumping ground for giant files (apps that try to do that are just broken).
Unfortunately firefox is one of those apps. I experimented with tmpfs /tmp a while back, and ran into very much badness. /tmp rapidly gets all full of large PDFs I've clicked on, as well as the flash plugin seems to like to spool video its streaming in /tmp.
Playing around with flash spooling, I noticed that Chrome uses ~/.cache/google-chrome... I wonder if firefox and friends should use places like that instead?
If /tmp is not supposed to be used for data that is inconvenient to store in memory for whatever reason, and that should be automatically removed when it is not used, what _is_ it supposed to be used for?
The FHS has some scattered guidance:
(1) http://www.pathname.com/fhs/pub/fhs-2.3.html#THEROOTFILESYSTEM
(2) http://www.pathname.com/fhs/pub/fhs-2.3.html#VARTMPTEMPORARYFILESPRESERVEDBE...
(3) http://www.pathname.com/fhs/pub/fhs-2.3.html#TMPTEMPORARYFILES
I read from this: that (1) the root filesystem should be considered a limited resource (as it is on some embedded systems, not necessarily on Fedora) and so you shouldn't store excessively large files there. "Root filesystem" would include /tmp in many but not all cases.
That (3) also says that /tmp can be cleaned up at each reboot. It isn't on Fedora, but it is on Debian for example. On Fedora /tmp is cleaned after 10 days.
That (2) says /var/tmp is suitable for files that need to persist across reboots. And because of (1) is also suitable for large files. On Fedora /var/tmp is cleaned after 30 days.
If what you're storing isn't a temporary file (whatever that means) then there are better places to put them: eg. the home directory, /var/cache, /var/spool etc.
After reading this I made some changes to libguestfs so it behaves more according to these rules.
Rich.
On Fri, 21.01.11 15:01, Richard W.M. Jones (rjones@redhat.com) wrote:
If /tmp is not supposed to be used for data that is inconvenient to store in memory for whatever reason, and that should be automatically removed when it is not used, what _is_ it supposed to be used for?
The FHS has some scattered guidance:
(1) http://www.pathname.com/fhs/pub/fhs-2.3.html#THEROOTFILESYSTEM
(2) http://www.pathname.com/fhs/pub/fhs-2.3.html#VARTMPTEMPORARYFILESPRESERVEDBE...
(3) http://www.pathname.com/fhs/pub/fhs-2.3.html#TMPTEMPORARYFILES
The FHS is kinda old these days, and it has been a while since it was last updated. The LSB added some additional rules on top of it:
http://refspecs.linux-foundation.org/LSB_4.0.0/LSB-Core-generic/LSB-Core-gen...
As did the XDG base dir spec:
http://standards.freedesktop.org/basedir-spec/basedir-spec-latest.html
Especially the latter introduced a few things that might be useful in this context.
Lennart
On Fri, Jan 21, 2011 at 05:54:22PM +0100, Lennart Poettering wrote:
The FHS is kinda old these days, and it has been a while since it was last updated. The LSB added some additional rules on top of it:
As long as we keep in mind that we don't follow the LSB at points where it is ridiculous.
On 01/22/2011 06:22 AM, Matthew Miller wrote:
On Fri, Jan 21, 2011 at 05:54:22PM +0100, Lennart Poettering wrote:
The FHS is kinda old these days, and it has been a while since it was last updated. The LSB added some additional rules on top of it:
As long as we keep in mind that we don't follow the LSB at points where it is ridiculous.
Give three examples, please.
--
On Mon, 20.12.10 17:26, Fernando Lopez-Lezcano (nando@ccrma.Stanford.EDU) wrote:
In other words, these are setup costs, not maintenance costs. This may cause glitches in a realtime scenario to the extent that clients are created and destroyed, but in general I submit that the cost of exec() of those new clients is going to dwarf the cost of the inode cache miss for the JACK socket. [2]
My experience (caveat: a long time ago, maybe everything has changed internally in both jack and the kernel and that has invalidated my experience cache :-) was that using /tmp would lead to constant - not all the time, but very frequent and not correlated with client connection/disconnection - xruns (glitches in the audio), using /dev/shm would fix that immediately. That was why things were moved over to /dev/shm if I remember correctly.
Smells like something related to atime updating.
Lennart
On Mon, 20.12.10 13:07, Fernando Lopez-Lezcano (nando@ccrma.Stanford.EDU) wrote:
Jack (the Jack Audio Connection Kit, jackaudio.org) has been using the file api (apologies if my wording is not absolutely correct in unix terms) on the tmpfs filesystem that is mounted on /dev/shm for a very long time (10 years?). "/tmp" is not useful to Jack because Jack's internal communication pipes can't be stored in any disk based journaled filesystem as the latencies involved in accessing them cause glitches in the audio streams handled by Jack.
to be frank I don't really buy this. A FIFO or socket in /tmp should be fine as long as it is opened with O_NOATIME. The data in the fifo buffers or the socket buffers never ever touches the disk and hence it is irrelevant whether it is tmpfs or a real disk.
I raise this issue because "The API for /dev/shm is shm_open()" statement above means to me that in the future there will be no file api access to a ram mounted filesystem in Fedora (I understand that this is my own conclusion, but I can't see any other given the wording of the statement above). Before someone implements that idea, please consider the needs of a filesystem in ram for such uses as those mentioned in this thread (and that is supported by the Fedora distribution by default). Just in case...
This too appears to be a good usecase for XDG_RUNTIME_DIR btw.
Lennart
On 12/23/2010 01:52 PM, Lennart Poettering wrote:
On Mon, 20.12.10 13:07, Fernando Lopez-Lezcano (nando@ccrma.Stanford.EDU) wrote:
I raise this issue because "The API for /dev/shm is shm_open()" statement above means to me that in the future there will be no file api access to a ram mounted filesystem in Fedora (I understand that this is my own conclusion, but I can't see any other given the wording of the statement above). Before someone implements that idea, please consider the needs of a filesystem in ram for such uses as those mentioned in this thread (and that is supported by the Fedora distribution by default). Just in case...
This too appears to be a good usecase for XDG_RUNTIME_DIR btw.
If I understand correctly this would only be available for logged in users only. If /var/run is going to be a tmpfs in fc15+ (if I understand correctly another message you posted in this thread) then that would appear to be a better option to my eyes (the main Jack developers might have other opinions/ideas, I'll try to keep them posted).
-- Fernando
On Fri, 24.12.10 16:17, Fernando Lopez-Lezcano (nando@ccrma.Stanford.EDU) wrote:
On 12/23/2010 01:52 PM, Lennart Poettering wrote:
On Mon, 20.12.10 13:07, Fernando Lopez-Lezcano (nando@ccrma.Stanford.EDU) wrote:
I raise this issue because "The API for /dev/shm is shm_open()" statement above means to me that in the future there will be no file api access to a ram mounted filesystem in Fedora (I understand that this is my own conclusion, but I can't see any other given the wording of the statement above). Before someone implements that idea, please consider the needs of a filesystem in ram for such uses as those mentioned in this thread (and that is supported by the Fedora distribution by default). Just in case...
This too appears to be a good usecase for XDG_RUNTIME_DIR btw.
If I understand correctly this would only be available for logged in users only. If /var/run is going to be a tmpfs in fc15+ (if I understand correctly another message you posted in this thread) then that would appear to be a better option to my eyes (the main Jack developers might have other opinions/ideas, I'll try to keep them posted).
For the precise semantics of XDG_RUNTIME_DIR please refer to the XDG basedir spec:
http://standards.freedesktop.org/basedir-spec/basedir-spec-latest.html
Most distributions and many programs implement this spec in one way or another, however the XDG_RUNTIME_DIR part is a relatively new addition, and F15 is probably the first bigger distro which implements it.
Lennart
devel@lists.stg.fedoraproject.org