On 01/29/2018 12:42 PM, Steven Whitehouse wrote:
-------- Forwarded Message -------- Subject: Re: Fedora27: NFS v4 terrible write performance, is async working Date: Sun, 28 Jan 2018 21:17:02 +0000 From: Terry Barnaby terry1@beam.ltd.uk To: Steven Whitehouse swhiteho@redhat.com, Development discussions related to Fedora devel@lists.fedoraproject.org, Terry Barnaby terry1@beam.ltd.uk CC: Steve Dickson steved@redhat.com, Benjamin Coddington bcodding@redhat.com
On 28/01/18 14:38, Steven Whitehouse wrote:
Hi,
On 28/01/18 07:48, Terry Barnaby wrote:
When doing a tar -xzf ... of a big source tar on an NFSv4 file system the time taken is huge. I am seeing an overall data rate of about 1 MByte per second across the network interface. If I copy a single large file I see a network data rate of about 110 MBytes/sec which is about the limit of the Gigabit Ethernet interface I am using.
Now, in the past I have used the NFS "async" mount option to help with write speed (lots of small files in the case of an untar of a set of source files).
However, this does not seem to speed this up in Fedora27 and also I don't see the "async" option listed when I run the "mount" command. When I use the "sync" option it does show up in the "mount" list.
The question is, is the "async" option actually working with NFS v4 in Fedora27 ?
No. Its something left over from v3 that allowed servers to be unsafe. With v4, the protocol defines stableness of the writes.
What server is in use? Is that Linux too? Also, is this v4.0 or v4.1? I've copied in some of the NFS team who should be able to assist,
Steve.
Thanks for the reply.
Server is a Fedora27 as well. vers=4.2 the default. Same issue at other sites with Fedora27.
Server export: "/data *.kingnet(rw,async,fsid=17)"
Client fstab: "king.kingnet:/data /data nfs async,nocto 0 0"
Client mount: "king.kingnet:/data on /data type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,nocto,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.202.2,local_lock=none,addr=192.168.202.1)"
This looks normal except for setting fsid=17...
The best way to debug this is to open up a bugzilla report and attached a (compressed) wireshark network trace to see what is happening on the wire... The entire tar is not needed just a good chunk...
steved.
On 29/01/18 19:50, Steve Dickson wrote:
On 01/29/2018 12:42 PM, Steven Whitehouse wrote:
-------- Forwarded Message -------- Subject: Re: Fedora27: NFS v4 terrible write performance, is async working Date: Sun, 28 Jan 2018 21:17:02 +0000 From: Terry Barnaby terry1@beam.ltd.uk To: Steven Whitehouse swhiteho@redhat.com, Development discussions related to Fedora devel@lists.fedoraproject.org, Terry Barnaby terry1@beam.ltd.uk CC: Steve Dickson steved@redhat.com, Benjamin Coddington bcodding@redhat.com
On 28/01/18 14:38, Steven Whitehouse wrote:
Hi,
On 28/01/18 07:48, Terry Barnaby wrote:
When doing a tar -xzf ... of a big source tar on an NFSv4 file system the time taken is huge. I am seeing an overall data rate of about 1 MByte per second across the network interface. If I copy a single large file I see a network data rate of about 110 MBytes/sec which is about the limit of the Gigabit Ethernet interface I am using.
Now, in the past I have used the NFS "async" mount option to help with write speed (lots of small files in the case of an untar of a set of source files).
However, this does not seem to speed this up in Fedora27 and also I don't see the "async" option listed when I run the "mount" command. When I use the "sync" option it does show up in the "mount" list.
The question is, is the "async" option actually working with NFS v4 in Fedora27 ?
No. Its something left over from v3 that allowed servers to be unsafe. With v4, the protocol defines stableness of the writes.
Thanks for the reply.
Ok, that's a shame unless NFSv4's write performance with small files/dirs is relatively ok which it isn't on my systems. Although async was "unsafe" this was not an issue in main standard scenarios such as an NFS mounted home directory only being used by one client. The async option also does not appear to work when using NFSv3. I guess it was removed from that protocol at some point as well ?
What server is in use? Is that Linux too? Also, is this v4.0 or v4.1? I've copied in some of the NFS team who should be able to assist,
Steve.
Thanks for the reply.
Server is a Fedora27 as well. vers=4.2 the default. Same issue at other sites with Fedora27.
Server export: "/data *.kingnet(rw,async,fsid=17)"
Client fstab: "king.kingnet:/data /data nfs async,nocto 0 0"
Client mount: "king.kingnet:/data on /data type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,nocto,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.202.2,local_lock=none,addr=192.168.202.1)"
This looks normal except for setting fsid=17...
The best way to debug this is to open up a bugzilla report and attached a (compressed) wireshark network trace to see what is happening on the wire... The entire tar is not needed just a good chunk...
steved.
Ok, will try doing the wireshark trace. What should I open a Bugzilla report against, the kernel ?
What is the expected sort of write performance when un-taring, for example, the linux kernel sources ? Is 2 MBytes/sec on average on a Gigabit link typical (3 mins to untar 4.14.15) or should it be better ?
On Mon, Jan 29, 2018 at 08:37:50PM +0000, Terry Barnaby wrote:
Ok, that's a shame unless NFSv4's write performance with small files/dirs is relatively ok which it isn't on my systems. Although async was "unsafe" this was not an issue in main standard scenarios such as an NFS mounted home directory only being used by one client. The async option also does not appear to work when using NFSv3. I guess it was removed from that protocol at some point as well ?
This isn't related to the NFS protocol version.
I think everybody's confusing the server-side "async" export option with the client-side mount "async" option. They're not really related.
The unsafe thing that speeds up file creates is the server-side "async" option. Sounds like you tried to use the client-side mount option instead, which wouldn't do anything.
What is the expected sort of write performance when un-taring, for example, the linux kernel sources ? Is 2 MBytes/sec on average on a Gigabit link typical (3 mins to untar 4.14.15) or should it be better ?
It's not bandwidth that matters, it's latency.
The file create isn't allowed to return until the server has created the file and the change has actually reached disk.
So an RPC has to reach the server, which has to wait for disk, and then the client has to get the RPC reply. Usually it's the disk latency that dominates.
And also the final close after the new file is written can't return until all the new file data has reached disk.
v4.14.15 has 61305 files:
$ git ls-tree -r v4.14.15|wc -l 61305
So time to create each file was about 3 minutes/61305 =~ 3ms.
So assuming two roundtrips per file, your disk latency is probably about 1.5ms?
You can improve the storage latency somehow (e.g. with a battery-backed write cache) or use more parallelism (has anyone ever tried to write a parallel untar?). Or you can cheat and set the async export option, and then the server will no longer wait for disk before replying. The problem is that on server reboot/crash, the client's assumptions about which operations succeeded may turn out to be wrong.
--b.
On 2018-01-29, J. Bruce Fields bfields@redhat.com wrote:
The file create isn't allowed to return until the server has created the file and the change has actually reached disk.
Why is there such a requirement? This is not true for local file systems. This is why fsync() exists.
-- Petr
On 29/01/18 22:28, J. Bruce Fields wrote:
On Mon, Jan 29, 2018 at 08:37:50PM +0000, Terry Barnaby wrote:
Ok, that's a shame unless NFSv4's write performance with small files/dirs is relatively ok which it isn't on my systems. Although async was "unsafe" this was not an issue in main standard scenarios such as an NFS mounted home directory only being used by one client. The async option also does not appear to work when using NFSv3. I guess it was removed from that protocol at some point as well ?
This isn't related to the NFS protocol version.
I think everybody's confusing the server-side "async" export option with the client-side mount "async" option. They're not really related.
The unsafe thing that speeds up file creates is the server-side "async" option. Sounds like you tried to use the client-side mount option instead, which wouldn't do anything.
What is the expected sort of write performance when un-taring, for example, the linux kernel sources ? Is 2 MBytes/sec on average on a Gigabit link typical (3 mins to untar 4.14.15) or should it be better ?
It's not bandwidth that matters, it's latency.
The file create isn't allowed to return until the server has created the file and the change has actually reached disk.
So an RPC has to reach the server, which has to wait for disk, and then the client has to get the RPC reply. Usually it's the disk latency that dominates.
And also the final close after the new file is written can't return until all the new file data has reached disk.
v4.14.15 has 61305 files:
$ git ls-tree -r v4.14.15|wc -l 61305
So time to create each file was about 3 minutes/61305 =~ 3ms.
So assuming two roundtrips per file, your disk latency is probably about 1.5ms?
You can improve the storage latency somehow (e.g. with a battery-backed write cache) or use more parallelism (has anyone ever tried to write a parallel untar?). Or you can cheat and set the async export option, and then the server will no longer wait for disk before replying. The problem is that on server reboot/crash, the client's assumptions about which operations succeeded may turn out to be wrong.
--b.
Many thanks for your reply.
Yes, I understand the above (latency and normally synchronous nature of NFS). I have async defined in the servers /etc/exports options. I have, later, also defined it on the client side as the async option on the server did not appear to be working and I wondered if with ongoing changes it had been moved there (would make some sense for the client to define it and pass this option over to the server as it knows, in most cases, if the bad aspects of async would be an issue to its usage in the situation in question).
It's a server with large disks, so SSD is not really an option. The use of async is ok for my usage (mainly /home mounted and users home files only in use by one client at a time etc etc.).
However I have just found that async is actually working! I just did not believe it was, due to the poor write performance. Without async on the server the performance is truly abysmal. The figures I get for untaring the kernel sources (4.14.15 895MBytes untared) using "rm -fr linux-4.14.15; sync; time (tar -xf linux-4.14.15.tar.gz -C /data2/tmp; sync)" are:
Untar on server to its local disk: 13 seconds, effective data rate: 68 MBytes/s
Untar on server over NFSv4.2 with async on server: 3 minutes, effective data rate: 4.9 MBytes/sec
Untar on server over NFSv4.2 without async on server: 2 hours 12 minutes, effective data rate: 115 kBytes/s !!
Is it really expected for NFS to be this bad these days with a reasonably typical operation and are there no other tuning parameters that can help ?
On Tue, Jan 30, 2018 at 08:49:27AM +0000, Terry Barnaby wrote:
On 29/01/18 22:28, J. Bruce Fields wrote:
On Mon, Jan 29, 2018 at 08:37:50PM +0000, Terry Barnaby wrote:
Ok, that's a shame unless NFSv4's write performance with small files/dirs is relatively ok which it isn't on my systems. Although async was "unsafe" this was not an issue in main standard scenarios such as an NFS mounted home directory only being used by one client. The async option also does not appear to work when using NFSv3. I guess it was removed from that protocol at some point as well ?
This isn't related to the NFS protocol version.
I think everybody's confusing the server-side "async" export option with the client-side mount "async" option. They're not really related.
The unsafe thing that speeds up file creates is the server-side "async" option. Sounds like you tried to use the client-side mount option instead, which wouldn't do anything.
What is the expected sort of write performance when un-taring, for example, the linux kernel sources ? Is 2 MBytes/sec on average on a Gigabit link typical (3 mins to untar 4.14.15) or should it be better ?
It's not bandwidth that matters, it's latency.
The file create isn't allowed to return until the server has created the file and the change has actually reached disk.
So an RPC has to reach the server, which has to wait for disk, and then the client has to get the RPC reply. Usually it's the disk latency that dominates.
And also the final close after the new file is written can't return until all the new file data has reached disk.
v4.14.15 has 61305 files:
$ git ls-tree -r v4.14.15|wc -l 61305
So time to create each file was about 3 minutes/61305 =~ 3ms.
So assuming two roundtrips per file, your disk latency is probably about 1.5ms?
You can improve the storage latency somehow (e.g. with a battery-backed write cache) or use more parallelism (has anyone ever tried to write a parallel untar?). Or you can cheat and set the async export option, and then the server will no longer wait for disk before replying. The problem is that on server reboot/crash, the client's assumptions about which operations succeeded may turn out to be wrong.
--b.
Many thanks for your reply.
Yes, I understand the above (latency and normally synchronous nature of NFS). I have async defined in the servers /etc/exports options. I have, later, also defined it on the client side as the async option on the server did not appear to be working and I wondered if with ongoing changes it had been moved there (would make some sense for the client to define it and pass this option over to the server as it knows, in most cases, if the bad aspects of async would be an issue to its usage in the situation in question).
It's a server with large disks, so SSD is not really an option. The use of async is ok for my usage (mainly /home mounted and users home files only in use by one client at a time etc etc.).
Note it's not concurrent access that will cause problems, it's server crashes. A UPS may reduce the risk a little.
However I have just found that async is actually working! I just did not believe it was, due to the poor write performance. Without async on the server the performance is truly abysmal. The figures I get for untaring the kernel sources (4.14.15 895MBytes untared) using "rm -fr linux-4.14.15; sync; time (tar -xf linux-4.14.15.tar.gz -C /data2/tmp; sync)" are:
Untar on server to its local disk: 13 seconds, effective data rate: 68 MBytes/s
Untar on server over NFSv4.2 with async on server: 3 minutes, effective data rate: 4.9 MBytes/sec
Untar on server over NFSv4.2 without async on server: 2 hours 12 minutes, effective data rate: 115 kBytes/s !!
2:12 is 7920 seconds, and you've got 61305 files to write, so that's about 130ms/file. That's more than I'd expect even if you're waiting for a few seeks on each file create, so there may indeed be something wrong.
By comparison on my little home server (Fedora, ext4, a couple WD Black 1TB drives), with sync, that untar takes is 7:44, about 8ms/file.
What's the disk configuration and what filesystem is this?
Is it really expected for NFS to be this bad these days with a reasonably typical operation and are there no other tuning parameters that can help ?
It's expected that the performance of single-threaded file creates will depend on latency, not bandwidth.
I believe high-performance servers use battery backed write caches with storage behind them that can do lots of IOPS.
(One thing I've been curious about is whether you could get better performance cheap on this kind of workload ext3/4 striped across a few drives and an external journal on SSD. But when I experimented with that a few years ago I found synchronous write latency wasn't much better. I didn't investigate why not, maybe that's just the way SSDs are.)
--b.
On 30/01/18 15:09, J. Bruce Fields wrote:
On Tue, Jan 30, 2018 at 08:49:27AM +0000, Terry Barnaby wrote:
On 29/01/18 22:28, J. Bruce Fields wrote:
On Mon, Jan 29, 2018 at 08:37:50PM +0000, Terry Barnaby wrote:
Ok, that's a shame unless NFSv4's write performance with small files/dirs is relatively ok which it isn't on my systems. Although async was "unsafe" this was not an issue in main standard scenarios such as an NFS mounted home directory only being used by one client. The async option also does not appear to work when using NFSv3. I guess it was removed from that protocol at some point as well ?
This isn't related to the NFS protocol version.
I think everybody's confusing the server-side "async" export option with the client-side mount "async" option. They're not really related.
The unsafe thing that speeds up file creates is the server-side "async" option. Sounds like you tried to use the client-side mount option instead, which wouldn't do anything.
What is the expected sort of write performance when un-taring, for example, the linux kernel sources ? Is 2 MBytes/sec on average on a Gigabit link typical (3 mins to untar 4.14.15) or should it be better ?
It's not bandwidth that matters, it's latency.
The file create isn't allowed to return until the server has created the file and the change has actually reached disk.
So an RPC has to reach the server, which has to wait for disk, and then the client has to get the RPC reply. Usually it's the disk latency that dominates.
And also the final close after the new file is written can't return until all the new file data has reached disk.
v4.14.15 has 61305 files:
$ git ls-tree -r v4.14.15|wc -l 61305
So time to create each file was about 3 minutes/61305 =~ 3ms.
So assuming two roundtrips per file, your disk latency is probably about 1.5ms?
You can improve the storage latency somehow (e.g. with a battery-backed write cache) or use more parallelism (has anyone ever tried to write a parallel untar?). Or you can cheat and set the async export option, and then the server will no longer wait for disk before replying. The problem is that on server reboot/crash, the client's assumptions about which operations succeeded may turn out to be wrong.
--b.
Many thanks for your reply.
Yes, I understand the above (latency and normally synchronous nature of NFS). I have async defined in the servers /etc/exports options. I have, later, also defined it on the client side as the async option on the server did not appear to be working and I wondered if with ongoing changes it had been moved there (would make some sense for the client to define it and pass this option over to the server as it knows, in most cases, if the bad aspects of async would be an issue to its usage in the situation in question).
It's a server with large disks, so SSD is not really an option. The use of async is ok for my usage (mainly /home mounted and users home files only in use by one client at a time etc etc.).
Note it's not concurrent access that will cause problems, it's server crashes. A UPS may reduce the risk a little.
However I have just found that async is actually working! I just did not believe it was, due to the poor write performance. Without async on the server the performance is truly abysmal. The figures I get for untaring the kernel sources (4.14.15 895MBytes untared) using "rm -fr linux-4.14.15; sync; time (tar -xf linux-4.14.15.tar.gz -C /data2/tmp; sync)" are:
Untar on server to its local disk: 13 seconds, effective data rate: 68 MBytes/s
Untar on server over NFSv4.2 with async on server: 3 minutes, effective data rate: 4.9 MBytes/sec
Untar on server over NFSv4.2 without async on server: 2 hours 12 minutes, effective data rate: 115 kBytes/s !!
2:12 is 7920 seconds, and you've got 61305 files to write, so that's about 130ms/file. That's more than I'd expect even if you're waiting for a few seeks on each file create, so there may indeed be something wrong.
By comparison on my little home server (Fedora, ext4, a couple WD Black 1TB drives), with sync, that untar takes is 7:44, about 8ms/file.
Ok, that is far more reasonable, so something is up on my systems :) What speed do you get with the server export set to async ?
What's the disk configuration and what filesystem is this?
Those tests above were to a single: SATA Western Digital Red 3TB, WDC WD30EFRX-68EUZN0 using ext4. Most of my tests have been to software RAID1 SATA disks, Western Digital Red 2TB on one server and Western Digital RE4 2TB WDC WD2003FYYS-02W0B1 on another quad core Xeon server all using ext4 and all having plenty of RAM. All on stock Fedora27 (both server and client) updated to date.
Is it really expected for NFS to be this bad these days with a reasonably typical operation and are there no other tuning parameters that can help ?
It's expected that the performance of single-threaded file creates will depend on latency, not bandwidth.
I believe high-performance servers use battery backed write caches with storage behind them that can do lots of IOPS.
(One thing I've been curious about is whether you could get better performance cheap on this kind of workload ext3/4 striped across a few drives and an external journal on SSD. But when I experimented with that a few years ago I found synchronous write latency wasn't much better. I didn't investigate why not, maybe that's just the way SSDs are.)
--b.
On Tue, Jan 30, 2018 at 03:29:41PM +0000, Terry Barnaby wrote:
On 30/01/18 15:09, J. Bruce Fields wrote:
By comparison on my little home server (Fedora, ext4, a couple WD Black 1TB drives), with sync, that untar takes is 7:44, about 8ms/file.
Ok, that is far more reasonable, so something is up on my systems :) What speed do you get with the server export set to async ?
I tried just now and got 4m2s.
The drives probably still have to do a seek or two per create, the difference now is that we don't have to wait for one create to start the next one, so the drives can work in parallel.
So given that I'm striping across two drives, I *think* it makes sense that I'm getting about double the performance with the async export option.
But that doesn't explain the difference between async and local performance (22s when I tried the same untar directly on the server, 25s when I included a final sync in the timing). And your numbers are a complete mystery.
--b.
What's the disk configuration and what filesystem is this?
Those tests above were to a single: SATA Western Digital Red 3TB, WDC WD30EFRX-68EUZN0 using ext4. Most of my tests have been to software RAID1 SATA disks, Western Digital Red 2TB on one server and Western Digital RE4 2TB WDC WD2003FYYS-02W0B1 on another quad core Xeon server all using ext4 and all having plenty of RAM. All on stock Fedora27 (both server and client) updated to date.
Is it really expected for NFS to be this bad these days with a reasonably typical operation and are there no other tuning parameters that can help ?
It's expected that the performance of single-threaded file creates will depend on latency, not bandwidth.
I believe high-performance servers use battery backed write caches with storage behind them that can do lots of IOPS.
(One thing I've been curious about is whether you could get better performance cheap on this kind of workload ext3/4 striped across a few drives and an external journal on SSD. But when I experimented with that a few years ago I found synchronous write latency wasn't much better. I didn't investigate why not, maybe that's just the way SSDs are.)
--b.
On 30/01/18 16:22, J. Bruce Fields wrote:
On Tue, Jan 30, 2018 at 03:29:41PM +0000, Terry Barnaby wrote:
On 30/01/18 15:09, J. Bruce Fields wrote:
By comparison on my little home server (Fedora, ext4, a couple WD Black 1TB drives), with sync, that untar takes is 7:44, about 8ms/file.
Ok, that is far more reasonable, so something is up on my systems :) What speed do you get with the server export set to async ?
I tried just now and got 4m2s.
The drives probably still have to do a seek or two per create, the difference now is that we don't have to wait for one create to start the next one, so the drives can work in parallel.
So given that I'm striping across two drives, I *think* it makes sense that I'm getting about double the performance with the async export option.
But that doesn't explain the difference between async and local performance (22s when I tried the same untar directly on the server, 25s when I included a final sync in the timing). And your numbers are a complete mystery.
I have just tried running the untar on our work systems. These are again Fedora27 but newer hardware. I set one of the servers NFS exports to just rw (removed the async option in /etc/exports and ran exportfs -arv). Remounted this NFS file system on a Fedora27 client and re-ran the test. I have only waited 10mins but the overal network data rate is in the order of 0.1 MBytes/sec so it looks like it will be a multiple hour job as at home. So I have two completely separate systems with the same performance over NFS. With your NFS "sync" test are you sure you set the "sync" mode on the server and re-exported the file systems ?
--b.
What's the disk configuration and what filesystem is this?
Those tests above were to a single: SATA Western Digital Red 3TB, WDC WD30EFRX-68EUZN0 using ext4. Most of my tests have been to software RAID1 SATA disks, Western Digital Red 2TB on one server and Western Digital RE4 2TB WDC WD2003FYYS-02W0B1 on another quad core Xeon server all using ext4 and all having plenty of RAM. All on stock Fedora27 (both server and client) updated to date.
Is it really expected for NFS to be this bad these days with a reasonably typical operation and are there no other tuning parameters that can help ?
It's expected that the performance of single-threaded file creates will depend on latency, not bandwidth.
I believe high-performance servers use battery backed write caches with storage behind them that can do lots of IOPS.
(One thing I've been curious about is whether you could get better performance cheap on this kind of workload ext3/4 striped across a few drives and an external journal on SSD. But when I experimented with that a few years ago I found synchronous write latency wasn't much better. I didn't investigate why not, maybe that's just the way SSDs are.)
--b.
On Tue, Jan 30, 2018 at 04:49:41PM +0000, Terry Barnaby wrote:
I have just tried running the untar on our work systems. These are again Fedora27 but newer hardware. I set one of the servers NFS exports to just rw (removed the async option in /etc/exports and ran exportfs -arv). Remounted this NFS file system on a Fedora27 client and re-ran the test. I have only waited 10mins but the overal network data rate is in the order of 0.1 MBytes/sec so it looks like it will be a multiple hour job as at home. So I have two completely separate systems with the same performance over NFS. With your NFS "sync" test are you sure you set the "sync" mode on the server and re-exported the file systems ?
Not being a daredevil, I use "sync" by default:
# exportfs -v /export <world>(rw,sync,wdelay,hide,no_subtree_check,sec=sys,insecure,no_root_squash,no_all_squash)
For the "async" case I changed the options and actually rebooted, yes.
The filesystem is:
/dev/mapper/export-export on /export type ext4 (rw,relatime,seclabel,nodelalloc,stripe=32,data=journal)
(I think data=journal is the only non-default, and I don't remember why I chose that.)
--b.
On Tue, Jan 30, 2018 at 12:31:22PM -0500, J. Bruce Fields wrote:
On Tue, Jan 30, 2018 at 04:49:41PM +0000, Terry Barnaby wrote:
I have just tried running the untar on our work systems. These are again Fedora27 but newer hardware. I set one of the servers NFS exports to just rw (removed the async option in /etc/exports and ran exportfs -arv). Remounted this NFS file system on a Fedora27 client and re-ran the test. I have only waited 10mins but the overal network data rate is in the order of 0.1 MBytes/sec so it looks like it will be a multiple hour job as at home. So I have two completely separate systems with the same performance over NFS. With your NFS "sync" test are you sure you set the "sync" mode on the server and re-exported the file systems ?
Not being a daredevil, I use "sync" by default:
# exportfs -v /export <world>(rw,sync,wdelay,hide,no_subtree_check,sec=sys,insecure,no_root_squash,no_all_squash)
For the "async" case I changed the options and actually rebooted, yes.
The filesystem is:
/dev/mapper/export-export on /export type ext4 (rw,relatime,seclabel,nodelalloc,stripe=32,data=journal)
(I think data=journal is the only non-default, and I don't remember why I chose that.)
Hah, well, with data=ordered (the default) the same untar (with "sync" export) took 15m38s. So... that probably wasn't an accident.
It may be irresponsible for me to guess given the state of my ignorance about ext4 journaling, but perhaps writing everything to the journal and delaying writing it out to its real location as long as possible allows some sort of tradeoff between bandwidth and seeks that helps with this sync-heavy workload.
--b.
On 30/01/18 17:54, J. Bruce Fields wrote:
On Tue, Jan 30, 2018 at 12:31:22PM -0500, J. Bruce Fields wrote:
On Tue, Jan 30, 2018 at 04:49:41PM +0000, Terry Barnaby wrote:
I have just tried running the untar on our work systems. These are again Fedora27 but newer hardware. I set one of the servers NFS exports to just rw (removed the async option in /etc/exports and ran exportfs -arv). Remounted this NFS file system on a Fedora27 client and re-ran the test. I have only waited 10mins but the overal network data rate is in the order of 0.1 MBytes/sec so it looks like it will be a multiple hour job as at home. So I have two completely separate systems with the same performance over NFS. With your NFS "sync" test are you sure you set the "sync" mode on the server and re-exported the file systems ?
Not being a daredevil, I use "sync" by default:
# exportfs -v /export <world>(rw,sync,wdelay,hide,no_subtree_check,sec=sys,insecure,no_root_squash,no_all_squash)
For the "async" case I changed the options and actually rebooted, yes.
The filesystem is:
/dev/mapper/export-export on /export type ext4 (rw,relatime,seclabel,nodelalloc,stripe=32,data=journal)
(I think data=journal is the only non-default, and I don't remember why I chose that.)
Hah, well, with data=ordered (the default) the same untar (with "sync" export) took 15m38s. So... that probably wasn't an accident.
It may be irresponsible for me to guess given the state of my ignorance about ext4 journaling, but perhaps writing everything to the journal and delaying writing it out to its real location as long as possible allows some sort of tradeoff between bandwidth and seeks that helps with this sync-heavy workload.
--b.
Being a daredevil, I have used the NFS async option for 27 years without an issue on multiple systems :)
I have just mounted my ext4 disk with the same options you were using and the same NFS export options and the speed here looks the same as I had previously. As I can't wait 2+ hours so I'm just looking at ksysguard and it is showing a network rate of about 10 KBytes/s and the directory on the server is growing in size very very slowly.
This is using the current Fedora27 kernel 4.14.14-300.fc27.x86_64.
I will have a look at using wireshark to see if this shows anything.
Being a daredevil, I have used the NFS async option for 27 years without an issue on multiple systems :)
I have just mounted my ext4 disk with the same options you were using and the same NFS export options and the speed here looks the same as I had previously. As I can't wait 2+ hours so I'm just looking at ksysguard and it is showing a network rate of about 10 KBytes/s and the directory on the server is growing in size very very slowly.
This is using the current Fedora27 kernel 4.14.14-300.fc27.x86_64.
I will have a look at using wireshark to see if this shows anything.
This is a snippet from a wireshark trace of the NFS when untaring the linux kernel 4.14.15 sources into an NFSv4.2 mounted directory with "sync" option on my NFS server. The whole untar would take > 2 hours vs 13 seconds direct to the disk. This is about 850 MBytes of 60k files. The following is a single, small file write.
No. Time Source Destination Protocol Length Info 1880 11.928600315 192.168.202.2 192.168.202.1 NFS 380 V4 Call (Reply In 1881) OPEN DH: 0xac0502f2/sysfs-c2port 1881 11.950329198 192.168.202.1 192.168.202.2 NFS 408 V4 Reply (Call In 1880) OPEN StateID: 0xaa72 1882 11.950446430 192.168.202.2 192.168.202.1 NFS 304 V4 Call (Reply In 1883) SETATTR FH: 0x825014ee 1883 11.972608880 192.168.202.1 192.168.202.2 NFS 336 V4 Reply (Call In 1882) SETATTR 1884 11.972754709 192.168.202.2 192.168.202.1 TCP 1516 785 â 2049 [ACK] Seq=465561 Ack=183381 Win=8990 Len=1448 TSval=1663691771 TSecr=3103357902 [TCP segment of a reassembled PDU] 1885 11.972763078 192.168.202.2 192.168.202.1 TCP 1516 785 â 2049 [ACK] Seq=467009 Ack=183381 Win=8990 Len=1448 TSval=1663691771 TSecr=3103357902 [TCP segment of a reassembled PDU] 1886 11.972979437 192.168.202.2 192.168.202.1 NFS 332 V4 Call (Reply In 1888) WRITE StateID: 0xafdf Offset: 0 Len: 2931 1887 11.973074490 192.168.202.1 192.168.202.2 TCP 68 2049 â 785 [ACK] Seq=183381 Ack=468721 Win=24557 Len=0 TSval=3103357902 TSecr=1663691771 1888 12.017153631 192.168.202.1 192.168.202.2 NFS 248 V4 Reply (Call In 1886) WRITE 1889 12.017338766 192.168.202.2 192.168.202.1 NFS 260 V4 Call (Reply In 1890) GETATTR FH: 0x825014ee 1890 12.017834411 192.168.202.1 192.168.202.2 NFS 312 V4 Reply (Call In 1889) GETATTR 1891 12.017961690 192.168.202.2 192.168.202.1 NFS 328 V4 Call (Reply In 1892) SETATTR FH: 0x825014ee 1892 12.039456634 192.168.202.1 192.168.202.2 NFS 336 V4 Reply (Call In 1891) SETATTR 1893 12.039536705 192.168.202.2 192.168.202.1 NFS 284 V4 Call (Reply In 1894) CLOSE StateID: 0xaa72 1894 12.039979528 192.168.202.1 192.168.202.2 NFS 248 V4 Reply (Call In 1893) CLOSE 1895 12.040077180 192.168.202.2 192.168.202.1 NFS 392 V4 Call (Reply In 1896) OPEN DH: 0xac0502f2/sysfs-cfq-target-latency 1896 12.061903798 192.168.202.1 192.168.202.2 NFS 408 V4 Reply (Call In 1895) OPEN StateID: 0xaa72
It looks like this takes about 100ms to write this small file. With the approx 60k files in the archive this would take about 6000 secs, so is in the 2 hours ballpark or the untar that I am seeing.
Looks like OPEN 21ms, SETATTR 22ms, WRITE 44ms, second SETATTR 21ms a lot of time ...
The following is for an "async" mount:
No. Time Source Destination Protocol Length Info 37393 7.630012608 192.168.202.2 192.168.202.1 NFS 396 V4 Call (Reply In 37394) OPEN DH: 0x1f828ac9/vidioc-dbg-g-chip-info.rst 37394 7.630488451 192.168.202.1 192.168.202.2 NFS 408 V4 Reply (Call In 37393) OPEN StateID: 0xaa72 37395 7.630525117 192.168.202.2 192.168.202.1 NFS 304 V4 Call (Reply In 37396) SETATTR FH: 0x0f65c554 37396 7.630980560 192.168.202.1 192.168.202.2 NFS 336 V4 Reply (Call In 37395) SETATTR 37397 7.631035171 192.168.202.2 192.168.202.1 TCP 1516 785 â 2049 [ACK] Seq=13054241 Ack=3620329 Win=8990 Len=1448 TSval=1664595527 TSecr=3104261711 [TCP segment of a reassembled PDU] 37398 7.631038994 192.168.202.2 192.168.202.1 TCP 1516 785 â 2049 [ACK] Seq=13055689 Ack=3620329 Win=8990 Len=1448 TSval=1664595527 TSecr=3104261711 [TCP segment of a reassembled PDU] 37399 7.631042228 192.168.202.2 192.168.202.1 TCP 1516 785 â 2049 [ACK] Seq=13057137 Ack=3620329 Win=8990 Len=1448 TSval=1664595527 TSecr=3104261711 [TCP segment of a reassembled PDU] 37400 7.631195554 192.168.202.2 192.168.202.1 NFS 448 V4 Call (Reply In 37402) WRITE StateID: 0xafdf Offset: 0 Len: 4493 37401 7.631277423 192.168.202.1 192.168.202.2 TCP 68 2049 â 785 [ACK] Seq=3620329 Ack=13058965 Win=24550 Len=0 TSval=3104261712 TSecr=1664595527 37402 7.631506418 192.168.202.1 192.168.202.2 NFS 248 V4 Reply (Call In 37400) WRITE 37403 7.631529718 192.168.202.2 192.168.202.1 NFS 260 V4 Call (Reply In 37404) GETATTR FH: 0x0f65c554 37404 7.631946710 192.168.202.1 192.168.202.2 NFS 312 V4 Reply (Call In 37403) GETATTR 37405 7.631982683 192.168.202.2 192.168.202.1 NFS 328 V4 Call (Reply In 37406) SETATTR FH: 0x0f65c554 37406 7.632423600 192.168.202.1 192.168.202.2 NFS 336 V4 Reply (Call In 37405) SETATTR 37407 7.632461397 192.168.202.2 192.168.202.1 NFS 284 V4 Call (Reply In 37408) CLOSE StateID: 0xaa72 37408 7.632880138 192.168.202.1 192.168.202.2 NFS 248 V4 Reply (Call In 37407) CLOSE 37409 7.632926994 192.168.202.2 192.168.202.1 NFS 396 V4 Call (Reply In 37410) OPEN DH: 0x1f828ac9/vidioc-dbg-g-register.rst 37410 7.633470097 192.168.202.1 192.168.202.2 NFS 408 V4 Reply (Call In 37409) OPEN StateID: 0xaa72
It looks like this takes about 3ms to write this small file. With the approx 60k files in the archive this would take about 180 secs, so is in the 3 minutes ballpark that I am seeing.
It looks like each RPC call takes about 0.5ms. Why do there need to be some many RPC calls for this ? The OPEN call could set the attribs, no need for the later GETATTR or SETATTR calls. Even the CLOSE could be integrated with the WRITE and taking this further OPEN could do OPEN, SETATTR, and some WRITE all in one.
Hi,
On 01/30/2018 01:03 PM, Terry Barnaby wrote:
Being a daredevil, I have used the NFS async option for 27 years without an issue on multiple systems :)
I have just mounted my ext4 disk with the same options you were using and the same NFS export options and the speed here looks the same as I had previously. As I can't wait 2+ hours so I'm just looking at ksysguard and it is showing a network rate of about 10 KBytes/s and the directory on the server is growing in size very very slowly.
This is using the current Fedora27 kernel 4.14.14-300.fc27.x86_64.
I will have a look at using wireshark to see if this shows anything.
This is a snippet from a wireshark trace of the NFS when untaring the linux kernel 4.14.15 sources into an NFSv4.2 mounted directory with "sync" option on my NFS server. The whole untar would take > 2 hours vs 13 seconds direct to the disk. This is about 850 MBytes of 60k files. The following is a single, small file write.
No. Time Source Destination Protocol Length Info 1880 11.928600315 192.168.202.2 192.168.202.1 NFS 380 V4 Call (Reply In 1881) OPEN DH: 0xac0502f2/sysfs-c2port 1881 11.950329198 192.168.202.1 192.168.202.2 NFS 408 V4 Reply (Call In 1880) OPEN StateID: 0xaa72 1882 11.950446430 192.168.202.2 192.168.202.1 NFS 304 V4 Call (Reply In 1883) SETATTR FH: 0x825014ee 1883 11.972608880 192.168.202.1 192.168.202.2 NFS 336 V4 Reply (Call In 1882) SETATTR 1884 11.972754709 192.168.202.2 192.168.202.1 TCP 1516 785 â 2049 [ACK] Seq=465561 Ack=183381 Win=8990 Len=1448 TSval=1663691771 TSecr=3103357902 [TCP segment of a reassembled PDU] 1885 11.972763078 192.168.202.2 192.168.202.1 TCP 1516 785 â 2049 [ACK] Seq=467009 Ack=183381 Win=8990 Len=1448 TSval=1663691771 TSecr=3103357902 [TCP segment of a reassembled PDU] 1886 11.972979437 192.168.202.2 192.168.202.1 NFS 332 V4 Call (Reply In 1888) WRITE StateID: 0xafdf Offset: 0 Len: 2931 1887 11.973074490 192.168.202.1 192.168.202.2 TCP 68 2049 â 785 [ACK] Seq=183381 Ack=468721 Win=24557 Len=0 TSval=3103357902 TSecr=1663691771 1888 12.017153631 192.168.202.1 192.168.202.2 NFS 248 V4 Reply (Call In 1886) WRITE 1889 12.017338766 192.168.202.2 192.168.202.1 NFS 260 V4 Call (Reply In 1890) GETATTR FH: 0x825014ee 1890 12.017834411 192.168.202.1 192.168.202.2 NFS 312 V4 Reply (Call In 1889) GETATTR 1891 12.017961690 192.168.202.2 192.168.202.1 NFS 328 V4 Call (Reply In 1892) SETATTR FH: 0x825014ee 1892 12.039456634 192.168.202.1 192.168.202.2 NFS 336 V4 Reply (Call In 1891) SETATTR 1893 12.039536705 192.168.202.2 192.168.202.1 NFS 284 V4 Call (Reply In 1894) CLOSE StateID: 0xaa72 1894 12.039979528 192.168.202.1 192.168.202.2 NFS 248 V4 Reply (Call In 1893) CLOSE 1895 12.040077180 192.168.202.2 192.168.202.1 NFS 392 V4 Call (Reply In 1896) OPEN DH: 0xac0502f2/sysfs-cfq-target-latency 1896 12.061903798 192.168.202.1 192.168.202.2 NFS 408 V4 Reply (Call In 1895) OPEN StateID: 0xaa72
It looks like this takes about 100ms to write this small file. With the approx 60k files in the archive this would take about 6000 secs, so is in the 2 hours ballpark or the untar that I am seeing.
Looks like OPEN 21ms, SETATTR 22ms, WRITE 44ms, second SETATTR 21ms a lot of time ...
The following is for an "async" mount:
No. Time Source Destination Protocol Length Info 37393 7.630012608 192.168.202.2 192.168.202.1 NFS 396 V4 Call (Reply In 37394) OPEN DH: 0x1f828ac9/vidioc-dbg-g-chip-info.rst 37394 7.630488451 192.168.202.1 192.168.202.2 NFS 408 V4 Reply (Call In 37393) OPEN StateID: 0xaa72 37395 7.630525117 192.168.202.2 192.168.202.1 NFS 304 V4 Call (Reply In 37396) SETATTR FH: 0x0f65c554 37396 7.630980560 192.168.202.1 192.168.202.2 NFS 336 V4 Reply (Call In 37395) SETATTR 37397 7.631035171 192.168.202.2 192.168.202.1 TCP 1516 785 â 2049 [ACK] Seq=13054241 Ack=3620329 Win=8990 Len=1448 TSval=1664595527 TSecr=3104261711 [TCP segment of a reassembled PDU] 37398 7.631038994 192.168.202.2 192.168.202.1 TCP 1516 785 â 2049 [ACK] Seq=13055689 Ack=3620329 Win=8990 Len=1448 TSval=1664595527 TSecr=3104261711 [TCP segment of a reassembled PDU] 37399 7.631042228 192.168.202.2 192.168.202.1 TCP 1516 785 â 2049 [ACK] Seq=13057137 Ack=3620329 Win=8990 Len=1448 TSval=1664595527 TSecr=3104261711 [TCP segment of a reassembled PDU] 37400 7.631195554 192.168.202.2 192.168.202.1 NFS 448 V4 Call (Reply In 37402) WRITE StateID: 0xafdf Offset: 0 Len: 4493 37401 7.631277423 192.168.202.1 192.168.202.2 TCP 68 2049 â 785 [ACK] Seq=3620329 Ack=13058965 Win=24550 Len=0 TSval=3104261712 TSecr=1664595527 37402 7.631506418 192.168.202.1 192.168.202.2 NFS 248 V4 Reply (Call In 37400) WRITE 37403 7.631529718 192.168.202.2 192.168.202.1 NFS 260 V4 Call (Reply In 37404) GETATTR FH: 0x0f65c554 37404 7.631946710 192.168.202.1 192.168.202.2 NFS 312 V4 Reply (Call In 37403) GETATTR 37405 7.631982683 192.168.202.2 192.168.202.1 NFS 328 V4 Call (Reply In 37406) SETATTR FH: 0x0f65c554 37406 7.632423600 192.168.202.1 192.168.202.2 NFS 336 V4 Reply (Call In 37405) SETATTR 37407 7.632461397 192.168.202.2 192.168.202.1 NFS 284 V4 Call (Reply In 37408) CLOSE StateID: 0xaa72 37408 7.632880138 192.168.202.1 192.168.202.2 NFS 248 V4 Reply (Call In 37407) CLOSE 37409 7.632926994 192.168.202.2 192.168.202.1 NFS 396 V4 Call (Reply In 37410) OPEN DH: 0x1f828ac9/vidioc-dbg-g-register.rst 37410 7.633470097 192.168.202.1 192.168.202.2 NFS 408 V4 Reply (Call In 37409) OPEN StateID: 0xaa72
It looks like this takes about 3ms to write this small file. With the approx 60k files in the archive this would take about 180 secs, so is in the 3 minutes ballpark that I am seeing.
It looks like each RPC call takes about 0.5ms. Why do there need to be some many RPC calls for this ? The OPEN call could set the attribs, no need for the later GETATTR or SETATTR calls. Even the CLOSE could be integrated with the WRITE and taking this further OPEN could do OPEN, SETATTR, and some WRITE all in one.
Have you tried this with a '-o nfsvers=3' during mount? Did that help?
I noticed a large decrease in my kernel build times across NFS/lan a while back after a machine/kernel/10g upgrade. After playing with mount/export options filesystem tuning/etc, I got to this point of timing a bunch of these operations vs the older machine, at which point I discovered that simply backing down to NFSv3 solved the problem.
AKA a nfsv3 server on a 10 year old 4 disk xfs RAID5 on 1Gb ethernet, was slower than a modern machine with a 8 disk xfs RAID5 on 10Gb on nfsv4. The effect was enough to change a kernel build from ~45 minutes down to less than 5.
On Tue, Jan 30, 2018 at 01:52:49PM -0600, Jeremy Linton wrote:
Have you tried this with a '-o nfsvers=3' during mount? Did that help?
I noticed a large decrease in my kernel build times across NFS/lan a while back after a machine/kernel/10g upgrade. After playing with mount/export options filesystem tuning/etc, I got to this point of timing a bunch of these operations vs the older machine, at which point I discovered that simply backing down to NFSv3 solved the problem.
AKA a nfsv3 server on a 10 year old 4 disk xfs RAID5 on 1Gb ethernet, was slower than a modern machine with a 8 disk xfs RAID5 on 10Gb on nfsv4. The effect was enough to change a kernel build from ~45 minutes down to less than 5.
Did you mean "faster than"?
Definitely worth trying, though I wouldn't expect it to make that big a difference in the untarring-a-kernel-tree case--I think the only RPC avoided in the v3 case would be the CLOSE, and it should be one of the faster ones.
In the kernel compile case there's probably also a lot of re-opening and re-reading files too? NFSv4 is chattier there too. Read delegations should help compensate, but we need to improve the heuristics that decide when they're given out.
All that aside I can't think what would explain that big a difference (45 minutes vs. 5). It might be interesting to figure out what happened.
--b.
On 01/31/2018 09:49 AM, J. Bruce Fields wrote:
On Tue, Jan 30, 2018 at 01:52:49PM -0600, Jeremy Linton wrote:
Have you tried this with a '-o nfsvers=3' during mount? Did that help?
I noticed a large decrease in my kernel build times across NFS/lan a while back after a machine/kernel/10g upgrade. After playing with mount/export options filesystem tuning/etc, I got to this point of timing a bunch of these operations vs the older machine, at which point I discovered that simply backing down to NFSv3 solved the problem.
AKA a nfsv3 server on a 10 year old 4 disk xfs RAID5 on 1Gb ethernet, was slower than a modern machine with a 8 disk xfs RAID5 on 10Gb on nfsv4. The effect was enough to change a kernel build from ~45 minutes down to less than 5.
Did you mean "faster than"?
Yes, sorry about that.
Definitely worth trying, though I wouldn't expect it to make that big a difference in the untarring-a-kernel-tree case--I think the only RPC avoided in the v3 case would be the CLOSE, and it should be one of the faster ones.
In the kernel compile case there's probably also a lot of re-opening and re-reading files too? NFSv4 is chattier there too. Read delegations should help compensate, but we need to improve the heuristics that decide when they're given out.
The main kernel include files get repeatedly hammered, despite them in theory being in cache, IIRC. So yes, if the concurrent (re)open path is even slightly slower its going to hurt a lot.
All that aside I can't think what would explain that big a difference (45 minutes vs. 5). It might be interesting to figure out what happened.
I had already spent more than my time allotted looking in the wrong direction at the filesystem/RAID (did turn off intellipark though) by the time I discovered the nfsv3/v4 perf delta. Its been sitting way down on the "things to look into" list for a long time now. I'm still using it as a NFS server so at some point I can take another look if the problem persists.
On 01/02/18 01:34, Jeremy Linton wrote:
On 01/31/2018 09:49 AM, J. Bruce Fields wrote:
On Tue, Jan 30, 2018 at 01:52:49PM -0600, Jeremy Linton wrote:
Have you tried this with a '-o nfsvers=3' during mount? Did that help?
I noticed a large decrease in my kernel build times across NFS/lan a while back after a machine/kernel/10g upgrade. After playing with mount/export options filesystem tuning/etc, I got to this point of timing a bunch of these operations vs the older machine, at which point I discovered that simply backing down to NFSv3 solved the problem.
AKA a nfsv3 server on a 10 year old 4 disk xfs RAID5 on 1Gb ethernet, was slower than a modern machine with a 8 disk xfs RAID5 on 10Gb on nfsv4. The effect was enough to change a kernel build from ~45 minutes down to less than 5.
Using NFSv3 in async mode is faster than NFSv4 in async mode (still abysmal in sync mode).
NFSv3 async: sync; time (tar -xf linux-4.14.15.tar.gz -C /data2/tmp; sync)
real 2m25.717s user 0m8.739s sys 0m13.362s
NFSv4 async: sync; time (tar -xf linux-4.14.15.tar.gz -C /data2/tmp; sync)
real 3m33.032s user 0m8.506s sys 0m16.930s
NFSv3 async: wireshark trace
No. Time Source Destination Protocol Length Info 18527 2.815884979 192.168.202.2 192.168.202.1 NFS 216 V3 CREATE Call (Reply In 18528), DH: 0x62f39428/dma.h Mode: EXCLUSIVE 18528 2.816362338 192.168.202.1 192.168.202.2 NFS 328 V3 CREATE Reply (Call In 18527) 18529 2.816418841 192.168.202.2 192.168.202.1 NFS 224 V3 SETATTR Call (Reply In 18530), FH: 0x13678ba0 18530 2.816871820 192.168.202.1 192.168.202.2 NFS 216 V3 SETATTR Reply (Call In 18529) 18531 2.816966771 192.168.202.2 192.168.202.1 NFS 1148 V3 WRITE Call (Reply In 18532), FH: 0x13678ba0 Offset: 0 Len: 934 FILE_SYNC 18532 2.817441291 192.168.202.1 192.168.202.2 NFS 208 V3 WRITE Reply (Call In 18531) Len: 934 FILE_SYNC 18533 2.817495775 192.168.202.2 192.168.202.1 NFS 236 V3 SETATTR Call (Reply In 18534), FH: 0x13678ba0 18534 2.817920346 192.168.202.1 192.168.202.2 NFS 216 V3 SETATTR Reply (Call In 18533) 18535 2.818002910 192.168.202.2 192.168.202.1 NFS 216 V3 CREATE Call (Reply In 18536), DH: 0x62f39428/elf.h Mode: EXCLUSIVE 18536 2.818492126 192.168.202.1 192.168.202.2 NFS 328 V3 CREATE Reply (Call In 18535)
This is taking about 2ms for a small file write rather than 3ms for NFSv4. There is an extra GETATTR and CLOSE RPC in NFSv4 accounting for the difference.
So where I am:
1. NFS in sync mode, at least on my two Fedora27 systems for my usage is completely unusable. (sync: 2 hours, async: 3 minutes, localdisk: 13 seconds).
2. NFS async mode is working, but the small writes are still very slow.
3. NFS in async mode is 30% better with NFSv3 than NFSv4 when writing small files due to the increased latency caused by NFSv4's two extra RPC calls.
I really think that in 2018 we should be able to have better NFS performance when writing many small files such as used in software development. This would speed up any system that was using NFS with this sort of workload dramatically and reduce power usage all for some improvements in the NFS protocol.
I don't know the details of if this would work, or who is responsible for NFS, but it would be good if possible to have some improvements (NFSv4.3 ?). Maybe:
1. Have an OPEN-SETATTR-WRITE RPC call all in one and a SETATTR-CLOSE call all in one. This would reduce the latency of a small file to 1ms rather than 3ms thus 66% faster. Would require the client to delay the OPEN/SETATTR until the first WRITE. Not sure how possible this is in the implementations. Maybe READ's could be improved as well but getting the OPEN through quick may be better in this case ?
2. Could go further with an OPEN-SETATTR-WRITE-CLOSE RPC call. (0.5ms vs 3ms).
3. On sync/async modes personally I think it would be better for the client to request the mount in sync/async mode. The setting of sync on the server side would just enforce sync mode for all clients. If the server is in the default async mode clients can mount using sync or async as to their requirements. This seems to match normal VFS semantics and usage patterns better.
4. The 0.5ms RPC latency seems a bit high (ICMP pings 0.12ms) . Maybe this is worth investigating in the Linux kernel processing (how ?) ?
5. The 20ms RPC latency I see in sync mode needs a look at on my system although async mode is fine for my usage. Maybe this ends up as 2 x 10ms drive seeks on ext4 and is thus expected.
On 01/02/18 08:29, Terry Barnaby wrote:
On 01/02/18 01:34, Jeremy Linton wrote:
On 01/31/2018 09:49 AM, J. Bruce Fields wrote:
On Tue, Jan 30, 2018 at 01:52:49PM -0600, Jeremy Linton wrote:
Have you tried this with a '-o nfsvers=3' during mount? Did that help?
I noticed a large decrease in my kernel build times across NFS/lan a while back after a machine/kernel/10g upgrade. After playing with mount/export options filesystem tuning/etc, I got to this point of timing a bunch of these operations vs the older machine, at which point I discovered that simply backing down to NFSv3 solved the problem.
AKA a nfsv3 server on a 10 year old 4 disk xfs RAID5 on 1Gb ethernet, was slower than a modern machine with a 8 disk xfs RAID5 on 10Gb on nfsv4. The effect was enough to change a kernel build from ~45 minutes down to less than 5.
Using NFSv3 in async mode is faster than NFSv4 in async mode (still abysmal in sync mode).
NFSv3 async: sync; time (tar -xf linux-4.14.15.tar.gz -C /data2/tmp; sync)
real 2m25.717s user 0m8.739s sys 0m13.362s
NFSv4 async: sync; time (tar -xf linux-4.14.15.tar.gz -C /data2/tmp; sync)
real 3m33.032s user 0m8.506s sys 0m16.930s
NFSv3 async: wireshark trace
No. Time Source Destination Protocol Length Info 18527 2.815884979 192.168.202.2 192.168.202.1 NFS 216 V3 CREATE Call (Reply In 18528), DH: 0x62f39428/dma.h Mode: EXCLUSIVE 18528 2.816362338 192.168.202.1 192.168.202.2 NFS 328 V3 CREATE Reply (Call In 18527) 18529 2.816418841 192.168.202.2 192.168.202.1 NFS 224 V3 SETATTR Call (Reply In 18530), FH: 0x13678ba0 18530 2.816871820 192.168.202.1 192.168.202.2 NFS 216 V3 SETATTR Reply (Call In 18529) 18531 2.816966771 192.168.202.2 192.168.202.1 NFS 1148 V3 WRITE Call (Reply In 18532), FH: 0x13678ba0 Offset: 0 Len: 934 FILE_SYNC 18532 2.817441291 192.168.202.1 192.168.202.2 NFS 208 V3 WRITE Reply (Call In 18531) Len: 934 FILE_SYNC 18533 2.817495775 192.168.202.2 192.168.202.1 NFS 236 V3 SETATTR Call (Reply In 18534), FH: 0x13678ba0 18534 2.817920346 192.168.202.1 192.168.202.2 NFS 216 V3 SETATTR Reply (Call In 18533) 18535 2.818002910 192.168.202.2 192.168.202.1 NFS 216 V3 CREATE Call (Reply In 18536), DH: 0x62f39428/elf.h Mode: EXCLUSIVE 18536 2.818492126 192.168.202.1 192.168.202.2 NFS 328 V3 CREATE Reply (Call In 18535)
This is taking about 2ms for a small file write rather than 3ms for NFSv4. There is an extra GETATTR and CLOSE RPC in NFSv4 accounting for the difference.
So where I am:
- NFS in sync mode, at least on my two Fedora27 systems for my usage
is completely unusable. (sync: 2 hours, async: 3 minutes, localdisk: 13 seconds).
NFS async mode is working, but the small writes are still very slow.
NFS in async mode is 30% better with NFSv3 than NFSv4 when writing
small files due to the increased latency caused by NFSv4's two extra RPC calls.
I really think that in 2018 we should be able to have better NFS performance when writing many small files such as used in software development. This would speed up any system that was using NFS with this sort of workload dramatically and reduce power usage all for some improvements in the NFS protocol.
I don't know the details of if this would work, or who is responsible for NFS, but it would be good if possible to have some improvements (NFSv4.3 ?). Maybe:
- Have an OPEN-SETATTR-WRITE RPC call all in one and a SETATTR-CLOSE
call all in one. This would reduce the latency of a small file to 1ms rather than 3ms thus 66% faster. Would require the client to delay the OPEN/SETATTR until the first WRITE. Not sure how possible this is in the implementations. Maybe READ's could be improved as well but getting the OPEN through quick may be better in this case ?
- Could go further with an OPEN-SETATTR-WRITE-CLOSE RPC call. (0.5ms
vs 3ms).
- On sync/async modes personally I think it would be better for the
client to request the mount in sync/async mode. The setting of sync on the server side would just enforce sync mode for all clients. If the server is in the default async mode clients can mount using sync or async as to their requirements. This seems to match normal VFS semantics and usage patterns better.
- The 0.5ms RPC latency seems a bit high (ICMP pings 0.12ms) . Maybe
this is worth investigating in the Linux kernel processing (how ?) ?
- The 20ms RPC latency I see in sync mode needs a look at on my
system although async mode is fine for my usage. Maybe this ends up as 2 x 10ms drive seeks on ext4 and is thus expected.
Yet another poor NFSv3 performance issue. If I do a "ls -lR" of a certain NFS mounted directory over a slow link (NFS over Openvpn over FTTP 80/20Mbps), just after mounting the file system (default NFSv4 mount with async), it takes about 9 seconds. If I run the same "ls -lR" again, just after, it takes about 60 seconds. So much for caching ! I have noticed Makefile based builds (over Ethernet 1Gbps) taking a long time with a second or so between each directory, I think this maybe why.
Listing the directory using a NFSv3 mount takes 67 seconds on the first mount and about the same on subsequent ones. No noticeable caching (default mount options with async), At least NFSv4 is fast the first time !
NFSv4 directory reads after mount:
No. Time Source Destination Protocol Length Info 667 4.560833210 192.168.202.2 192.168.201.1 NFS 304 V4 Call (Reply In 672) READDIR FH: 0xde55a546 668 4.582809439 192.168.201.1 192.168.202.2 TCP 1405 2049 â 679 [ACK] Seq=304477 Ack=45901 Win=1452 Len=1337 TSval=2646321616 TSecr=913651354 [TCP segment of a reassembled PDU] 669 4.582986377 192.168.201.1 192.168.202.2 TCP 1405 2049 â 679 [ACK] Seq=305814 Ack=45901 Win=1452 Len=1337 TSval=2646321616 TSecr=913651354 [TCP segment of a reassembled PDU] 670 4.583003805 192.168.202.2 192.168.201.1 TCP 68 679 â 2049 [ACK] Seq=45901 Ack=307151 Win=1444 Len=0 TSval=913651376 TSecr=2646321616 671 4.583265423 192.168.201.1 192.168.202.2 TCP 1405 2049 â 679 [ACK] Seq=307151 Ack=45901 Win=1452 Len=1337 TSval=2646321616 TSecr=913651354 [TCP segment of a reassembled PDU] 672 4.583280603 192.168.201.1 192.168.202.2 NFS 289 V4 Reply (Call In 667) READDIR 673 4.583291818 192.168.202.2 192.168.201.1 TCP 68 679 â 2049 [ACK] Seq=45901 Ack=308709 Win=1444 Len=0 TSval=913651377 TSecr=2646321616 674 4.583819172 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 675) GETATTR FH: 0xb91bfde7 675 4.605389953 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 674) GETATTR 676 4.605491075 192.168.202.2 192.168.201.1 NFS 288 V4 Call (Reply In 677) ACCESS FH: 0xb91bfde7, [Check: RD LU MD XT DL] 677 4.626848306 192.168.201.1 192.168.202.2 NFS 240 V4 Reply (Call In 676) ACCESS, [Allowed: RD LU MD XT DL] 678 4.626993773 192.168.202.2 192.168.201.1 NFS 304 V4 Call (Reply In 679) READDIR FH: 0xb91bfde7 679 4.649330354 192.168.201.1 192.168.202.2 NFS 2408 V4 Reply (Call In 678) READDIR 680 4.649380840 192.168.202.2 192.168.201.1 TCP 68 679 â 2049 [ACK] Seq=46569 Ack=311465 Win=1444 Len=0 TSval=913651443 TSecr=2646321683 681 4.649716746 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 682) GETATTR FH: 0xb6d01f2a 682 4.671167708 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 681) GETATTR 683 4.671281003 192.168.202.2 192.168.201.1 NFS 288 V4 Call (Reply In 684) ACCESS FH: 0xb6d01f2a, [Check: RD LU MD XT DL] 684 4.692647455 192.168.201.1 192.168.202.2 NFS 240 V4 Reply (Call In 683) ACCESS, [Allowed: RD LU MD XT DL] 685 4.692825251 192.168.202.2 192.168.201.1 NFS 304 V4 Call (Reply In 690) READDIR FH: 0xb6d01f2a 686 4.715060586 192.168.201.1 192.168.202.2 TCP 1405 2049 â 679 [ACK] Seq=311881 Ack=47237 Win=1452 Len=1337 TSval=2646321748 TSecr=913651486 [TCP segment of a reassembled PDU] 687 4.715199557 192.168.201.1 192.168.202.2 TCP 1405 2049 â 679 [ACK] Seq=313218 Ack=47237 Win=1452 Len=1337 TSval=2646321748 TSecr=913651486 [TCP segment of a reassembled PDU] 688 4.715215055 192.168.202.2 192.168.201.1 TCP 68 679 â 2049 [ACK] Seq=47237 Ack=314555 Win=1444 Len=0 TSval=913651509 TSecr=2646321748 689 4.715524465 192.168.201.1 192.168.202.2 TCP 1405 2049 â 679 [ACK] Seq=314555 Ack=47237 Win=1452 Len=1337 TSval=2646321749 TSecr=913651486 [TCP segment of a reassembled PDU] 690 4.715911571 192.168.201.1 192.168.202.2 NFS 1449 V4 Reply (Call In 685) READDIR
NFS directory reads later:
No. Time Source Destination Protocol Length Info 664 9.485593049 192.168.202.2 192.168.201.1 NFS 304 V4 Call (Reply In 669) READDIR FH: 0x1933e99e 665 9.507596250 192.168.201.1 192.168.202.2 TCP 1405 2049 â 788 [ACK] Seq=127921 Ack=65730 Win=3076 Len=1337 TSval=2645776572 TSecr=913106316 [TCP segment of a reassembled PDU] 666 9.507717425 192.168.201.1 192.168.202.2 TCP 1405 2049 â 788 [ACK] Seq=129258 Ack=65730 Win=3076 Len=1337 TSval=2645776572 TSecr=913106316 [TCP segment of a reassembled PDU] 667 9.507733352 192.168.202.2 192.168.201.1 TCP 68 788 â 2049 [ACK] Seq=65730 Ack=130595 Win=1444 Len=0 TSval=913106338 TSecr=2645776572 668 9.507987020 192.168.201.1 192.168.202.2 TCP 1405 2049 â 788 [ACK] Seq=130595 Ack=65730 Win=3076 Len=1337 TSval=2645776572 TSecr=913106316 [TCP segment of a reassembled PDU] 669 9.508456847 192.168.201.1 192.168.202.2 NFS 989 V4 Reply (Call In 664) READDIR 670 9.508472149 192.168.202.2 192.168.201.1 TCP 68 788 â 2049 [ACK] Seq=65730 Ack=132853 Win=1444 Len=0 TSval=913106338 TSecr=2645776572 671 9.508880627 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 672) GETATTR FH: 0x7e9e8300 672 9.530375865 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 671) GETATTR 673 9.530564317 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 674) GETATTR FH: 0xcb837ac9 674 9.551906321 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 673) GETATTR 675 9.552064038 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 676) GETATTR FH: 0xbf951d32 676 9.574210528 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 675) GETATTR 677 9.574334117 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 678) GETATTR FH: 0xd3f3dc3e 678 9.595902902 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 677) GETATTR 679 9.596025484 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 680) GETATTR FH: 0xf534332a 680 9.617497794 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 679) GETATTR 681 9.617621218 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 682) GETATTR FH: 0xa7e5bbc5 682 9.639157371 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 681) GETATTR 683 9.639279098 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 684) GETATTR FH: 0xa8050515 684 9.660669335 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 683) GETATTR 685 9.660787725 192.168.202.2 192.168.201.1 NFS 304 V4 Call (Reply In 686) READDIR FH: 0x7e9e8300 686 9.682612756 192.168.201.1 192.168.202.2 NFS 1472 V4 Reply (Call In 685) READDIR 687 9.682646761 192.168.202.2 192.168.201.1 TCP 68 788 â 2049 [ACK] Seq=67450 Ack=135965 Win=1444 Len=0 TSval=913106513 TSecr=2645776747 688 9.682906293 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 689) GETATTR FH: 0xa8050515
Lots of GETATTR calls the second time around (each file ?).
Really NFS is really broken performance wise these days and it "appears" that significant/huge improvements are possible.
Anyone know what group/who is responsible for NFS protocol these days ?
Also what group/who is responsible for the Linux kernel's implementation of it ?
http://vger.kernel.org/vger-lists.html#linux-nfs
On Mon, Feb 05, 2018 at 08:21:06AM +0000, Terry Barnaby wrote:
On 01/02/18 08:29, Terry Barnaby wrote:
On 01/02/18 01:34, Jeremy Linton wrote:
On 01/31/2018 09:49 AM, J. Bruce Fields wrote:
On Tue, Jan 30, 2018 at 01:52:49PM -0600, Jeremy Linton wrote:
Have you tried this with a '-o nfsvers=3' during mount? Did that help?
I noticed a large decrease in my kernel build times across NFS/lan a while back after a machine/kernel/10g upgrade. After playing with mount/export options filesystem tuning/etc, I got to this point of timing a bunch of these operations vs the older machine, at which point I discovered that simply backing down to NFSv3 solved the problem.
AKA a nfsv3 server on a 10 year old 4 disk xfs RAID5 on 1Gb ethernet, was slower than a modern machine with a 8 disk xfs RAID5 on 10Gb on nfsv4. The effect was enough to change a kernel build from ~45 minutes down to less than 5.
Using NFSv3 in async mode is faster than NFSv4 in async mode (still abysmal in sync mode).
NFSv3 async: sync; time (tar -xf linux-4.14.15.tar.gz -C /data2/tmp; sync)
real 2m25.717s user 0m8.739s sys 0m13.362s
NFSv4 async: sync; time (tar -xf linux-4.14.15.tar.gz -C /data2/tmp; sync)
real 3m33.032s user 0m8.506s sys 0m16.930s
NFSv3 async: wireshark trace
No. Time Source Destination Protocol Length Info 18527 2.815884979 192.168.202.2 192.168.202.1 NFS 216 V3 CREATE Call (Reply In 18528), DH: 0x62f39428/dma.h Mode: EXCLUSIVE 18528 2.816362338 192.168.202.1 192.168.202.2 NFS 328 V3 CREATE Reply (Call In 18527) 18529 2.816418841 192.168.202.2 192.168.202.1 NFS 224 V3 SETATTR Call (Reply In 18530), FH: 0x13678ba0 18530 2.816871820 192.168.202.1 192.168.202.2 NFS 216 V3 SETATTR Reply (Call In 18529) 18531 2.816966771 192.168.202.2 192.168.202.1 NFS 1148 V3 WRITE Call (Reply In 18532), FH: 0x13678ba0 Offset: 0 Len: 934 FILE_SYNC 18532 2.817441291 192.168.202.1 192.168.202.2 NFS 208 V3 WRITE Reply (Call In 18531) Len: 934 FILE_SYNC 18533 2.817495775 192.168.202.2 192.168.202.1 NFS 236 V3 SETATTR Call (Reply In 18534), FH: 0x13678ba0 18534 2.817920346 192.168.202.1 192.168.202.2 NFS 216 V3 SETATTR Reply (Call In 18533) 18535 2.818002910 192.168.202.2 192.168.202.1 NFS 216 V3 CREATE Call (Reply In 18536), DH: 0x62f39428/elf.h Mode: EXCLUSIVE 18536 2.818492126 192.168.202.1 192.168.202.2 NFS 328 V3 CREATE Reply (Call In 18535)
This is taking about 2ms for a small file write rather than 3ms for NFSv4. There is an extra GETATTR and CLOSE RPC in NFSv4 accounting for the difference.
So where I am:
- NFS in sync mode, at least on my two Fedora27 systems for my usage is
completely unusable. (sync: 2 hours, async: 3 minutes, localdisk: 13 seconds).
NFS async mode is working, but the small writes are still very slow.
NFS in async mode is 30% better with NFSv3 than NFSv4 when writing
small files due to the increased latency caused by NFSv4's two extra RPC calls.
I really think that in 2018 we should be able to have better NFS performance when writing many small files such as used in software development. This would speed up any system that was using NFS with this sort of workload dramatically and reduce power usage all for some improvements in the NFS protocol.
I don't know the details of if this would work, or who is responsible for NFS, but it would be good if possible to have some improvements (NFSv4.3 ?). Maybe:
- Have an OPEN-SETATTR-WRITE RPC call all in one and a SETATTR-CLOSE
call all in one. This would reduce the latency of a small file to 1ms rather than 3ms thus 66% faster. Would require the client to delay the OPEN/SETATTR until the first WRITE. Not sure how possible this is in the implementations. Maybe READ's could be improved as well but getting the OPEN through quick may be better in this case ?
- Could go further with an OPEN-SETATTR-WRITE-CLOSE RPC call. (0.5ms vs
3ms).
- On sync/async modes personally I think it would be better for the
client to request the mount in sync/async mode. The setting of sync on the server side would just enforce sync mode for all clients. If the server is in the default async mode clients can mount using sync or async as to their requirements. This seems to match normal VFS semantics and usage patterns better.
- The 0.5ms RPC latency seems a bit high (ICMP pings 0.12ms) . Maybe
this is worth investigating in the Linux kernel processing (how ?) ?
- The 20ms RPC latency I see in sync mode needs a look at on my system
although async mode is fine for my usage. Maybe this ends up as 2 x 10ms drive seeks on ext4 and is thus expected.
Yet another poor NFSv3 performance issue. If I do a "ls -lR" of a certain NFS mounted directory over a slow link (NFS over Openvpn over FTTP 80/20Mbps), just after mounting the file system (default NFSv4 mount with async), it takes about 9 seconds. If I run the same "ls -lR" again, just after, it takes about 60 seconds.
A wireshark trace might help.
Also, is it possible some process is writing while this is happening?
--b.
So much for caching ! I have noticed Makefile based builds (over Ethernet 1Gbps) taking a long time with a second or so between each directory, I think this maybe why.
Listing the directory using a NFSv3 mount takes 67 seconds on the first mount and about the same on subsequent ones. No noticeable caching (default mount options with async), At least NFSv4 is fast the first time !
NFSv4 directory reads after mount:
No. Time Source Destination Protocol Length Info 667 4.560833210 192.168.202.2 192.168.201.1 NFS 304 V4 Call (Reply In 672) READDIR FH: 0xde55a546 668 4.582809439 192.168.201.1 192.168.202.2 TCP 1405 2049 â 679 [ACK] Seq=304477 Ack=45901 Win=1452 Len=1337 TSval=2646321616 TSecr=913651354 [TCP segment of a reassembled PDU] 669 4.582986377 192.168.201.1 192.168.202.2 TCP 1405 2049 â 679 [ACK] Seq=305814 Ack=45901 Win=1452 Len=1337 TSval=2646321616 TSecr=913651354 [TCP segment of a reassembled PDU] 670 4.583003805 192.168.202.2 192.168.201.1 TCP 68 679 â 2049 [ACK] Seq=45901 Ack=307151 Win=1444 Len=0 TSval=913651376 TSecr=2646321616 671 4.583265423 192.168.201.1 192.168.202.2 TCP 1405 2049 â 679 [ACK] Seq=307151 Ack=45901 Win=1452 Len=1337 TSval=2646321616 TSecr=913651354 [TCP segment of a reassembled PDU] 672 4.583280603 192.168.201.1 192.168.202.2 NFS 289 V4 Reply (Call In 667) READDIR 673 4.583291818 192.168.202.2 192.168.201.1 TCP 68 679 â 2049 [ACK] Seq=45901 Ack=308709 Win=1444 Len=0 TSval=913651377 TSecr=2646321616 674 4.583819172 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 675) GETATTR FH: 0xb91bfde7 675 4.605389953 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 674) GETATTR 676 4.605491075 192.168.202.2 192.168.201.1 NFS 288 V4 Call (Reply In 677) ACCESS FH: 0xb91bfde7, [Check: RD LU MD XT DL] 677 4.626848306 192.168.201.1 192.168.202.2 NFS 240 V4 Reply (Call In 676) ACCESS, [Allowed: RD LU MD XT DL] 678 4.626993773 192.168.202.2 192.168.201.1 NFS 304 V4 Call (Reply In 679) READDIR FH: 0xb91bfde7 679 4.649330354 192.168.201.1 192.168.202.2 NFS 2408 V4 Reply (Call In 678) READDIR 680 4.649380840 192.168.202.2 192.168.201.1 TCP 68 679 â 2049 [ACK] Seq=46569 Ack=311465 Win=1444 Len=0 TSval=913651443 TSecr=2646321683 681 4.649716746 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 682) GETATTR FH: 0xb6d01f2a 682 4.671167708 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 681) GETATTR 683 4.671281003 192.168.202.2 192.168.201.1 NFS 288 V4 Call (Reply In 684) ACCESS FH: 0xb6d01f2a, [Check: RD LU MD XT DL] 684 4.692647455 192.168.201.1 192.168.202.2 NFS 240 V4 Reply (Call In 683) ACCESS, [Allowed: RD LU MD XT DL] 685 4.692825251 192.168.202.2 192.168.201.1 NFS 304 V4 Call (Reply In 690) READDIR FH: 0xb6d01f2a 686 4.715060586 192.168.201.1 192.168.202.2 TCP 1405 2049 â 679 [ACK] Seq=311881 Ack=47237 Win=1452 Len=1337 TSval=2646321748 TSecr=913651486 [TCP segment of a reassembled PDU] 687 4.715199557 192.168.201.1 192.168.202.2 TCP 1405 2049 â 679 [ACK] Seq=313218 Ack=47237 Win=1452 Len=1337 TSval=2646321748 TSecr=913651486 [TCP segment of a reassembled PDU] 688 4.715215055 192.168.202.2 192.168.201.1 TCP 68 679 â 2049 [ACK] Seq=47237 Ack=314555 Win=1444 Len=0 TSval=913651509 TSecr=2646321748 689 4.715524465 192.168.201.1 192.168.202.2 TCP 1405 2049 â 679 [ACK] Seq=314555 Ack=47237 Win=1452 Len=1337 TSval=2646321749 TSecr=913651486 [TCP segment of a reassembled PDU] 690 4.715911571 192.168.201.1 192.168.202.2 NFS 1449 V4 Reply (Call In 685) READDIR
NFS directory reads later:
No. Time Source Destination Protocol Length Info 664 9.485593049 192.168.202.2 192.168.201.1 NFS 304 V4 Call (Reply In 669) READDIR FH: 0x1933e99e 665 9.507596250 192.168.201.1 192.168.202.2 TCP 1405 2049 â 788 [ACK] Seq=127921 Ack=65730 Win=3076 Len=1337 TSval=2645776572 TSecr=913106316 [TCP segment of a reassembled PDU] 666 9.507717425 192.168.201.1 192.168.202.2 TCP 1405 2049 â 788 [ACK] Seq=129258 Ack=65730 Win=3076 Len=1337 TSval=2645776572 TSecr=913106316 [TCP segment of a reassembled PDU] 667 9.507733352 192.168.202.2 192.168.201.1 TCP 68 788 â 2049 [ACK] Seq=65730 Ack=130595 Win=1444 Len=0 TSval=913106338 TSecr=2645776572 668 9.507987020 192.168.201.1 192.168.202.2 TCP 1405 2049 â 788 [ACK] Seq=130595 Ack=65730 Win=3076 Len=1337 TSval=2645776572 TSecr=913106316 [TCP segment of a reassembled PDU] 669 9.508456847 192.168.201.1 192.168.202.2 NFS 989 V4 Reply (Call In 664) READDIR 670 9.508472149 192.168.202.2 192.168.201.1 TCP 68 788 â 2049 [ACK] Seq=65730 Ack=132853 Win=1444 Len=0 TSval=913106338 TSecr=2645776572 671 9.508880627 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 672) GETATTR FH: 0x7e9e8300 672 9.530375865 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 671) GETATTR 673 9.530564317 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 674) GETATTR FH: 0xcb837ac9 674 9.551906321 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 673) GETATTR 675 9.552064038 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 676) GETATTR FH: 0xbf951d32 676 9.574210528 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 675) GETATTR 677 9.574334117 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 678) GETATTR FH: 0xd3f3dc3e 678 9.595902902 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 677) GETATTR 679 9.596025484 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 680) GETATTR FH: 0xf534332a 680 9.617497794 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 679) GETATTR 681 9.617621218 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 682) GETATTR FH: 0xa7e5bbc5 682 9.639157371 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 681) GETATTR 683 9.639279098 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 684) GETATTR FH: 0xa8050515 684 9.660669335 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 683) GETATTR 685 9.660787725 192.168.202.2 192.168.201.1 NFS 304 V4 Call (Reply In 686) READDIR FH: 0x7e9e8300 686 9.682612756 192.168.201.1 192.168.202.2 NFS 1472 V4 Reply (Call In 685) READDIR 687 9.682646761 192.168.202.2 192.168.201.1 TCP 68 788 â 2049 [ACK] Seq=67450 Ack=135965 Win=1444 Len=0 TSval=913106513 TSecr=2645776747 688 9.682906293 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 689) GETATTR FH: 0xa8050515
Lots of GETATTR calls the second time around (each file ?).
Really NFS is really broken performance wise these days and it "appears" that significant/huge improvements are possible.
Anyone know what group/who is responsible for NFS protocol these days ?
Also what group/who is responsible for the Linux kernel's implementation of it ?
On 05/02/18 14:52, J. Bruce Fields wrote:
Yet another poor NFSv3 performance issue. If I do a "ls -lR" of a certain NFS mounted directory over a slow link (NFS over Openvpn over FTTP 80/20Mbps), just after mounting the file system (default NFSv4 mount with async), it takes about 9 seconds. If I run the same "ls -lR" again, just after, it takes about 60 seconds.
A wireshark trace might help.
Also, is it possible some process is writing while this is happening?
--b.
Ok, I have made some wireshark traces and put these at:
https://www.beam.ltd.uk/files/files//nfs/
There are other processing running obviously, but nothing that should be doing anything that should really affect this.
As a naive input, it looks like the client is using a cache but checking the update times of each file individually using GETATTR. As it is using a simple GETATTR per file in each directory the latency of these RPC calls is mounting up. I guess it would be possible to check the cache status of all files in a dir at once with one call that would allow this to be faster when a full readdir is in progress, like a "GETATTR_DIR <dir>" RPC call. The overhead of the extra data would probably not affect a single file check cache time as latency rather than amount of data is the killer.
So much for caching ! I have noticed Makefile based builds (over Ethernet 1Gbps) taking a long time with a second or so between each directory, I think this maybe why.
Listing the directory using a NFSv3 mount takes 67 seconds on the first mount and about the same on subsequent ones. No noticeable caching (default mount options with async), At least NFSv4 is fast the first time !
NFSv4 directory reads after mount:
No. Time Source Destination Protocol Length Info 667 4.560833210 192.168.202.2 192.168.201.1 NFS 304 V4 Call (Reply In 672) READDIR FH: 0xde55a546 668 4.582809439 192.168.201.1 192.168.202.2 TCP 1405 2049 → 679 [ACK] Seq=304477 Ack=45901 Win=1452 Len=1337 TSval=2646321616 TSecr=913651354 [TCP segment of a reassembled PDU] 669 4.582986377 192.168.201.1 192.168.202.2 TCP 1405 2049 → 679 [ACK] Seq=305814 Ack=45901 Win=1452 Len=1337 TSval=2646321616 TSecr=913651354 [TCP segment of a reassembled PDU] 670 4.583003805 192.168.202.2 192.168.201.1 TCP 68 679 → 2049 [ACK] Seq=45901 Ack=307151 Win=1444 Len=0 TSval=913651376 TSecr=2646321616 671 4.583265423 192.168.201.1 192.168.202.2 TCP 1405 2049 → 679 [ACK] Seq=307151 Ack=45901 Win=1452 Len=1337 TSval=2646321616 TSecr=913651354 [TCP segment of a reassembled PDU] 672 4.583280603 192.168.201.1 192.168.202.2 NFS 289 V4 Reply (Call In 667) READDIR 673 4.583291818 192.168.202.2 192.168.201.1 TCP 68 679 → 2049 [ACK] Seq=45901 Ack=308709 Win=1444 Len=0 TSval=913651377 TSecr=2646321616 674 4.583819172 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 675) GETATTR FH: 0xb91bfde7 675 4.605389953 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 674) GETATTR 676 4.605491075 192.168.202.2 192.168.201.1 NFS 288 V4 Call (Reply In 677) ACCESS FH: 0xb91bfde7, [Check: RD LU MD XT DL] 677 4.626848306 192.168.201.1 192.168.202.2 NFS 240 V4 Reply (Call In 676) ACCESS, [Allowed: RD LU MD XT DL] 678 4.626993773 192.168.202.2 192.168.201.1 NFS 304 V4 Call (Reply In 679) READDIR FH: 0xb91bfde7 679 4.649330354 192.168.201.1 192.168.202.2 NFS 2408 V4 Reply (Call In 678) READDIR 680 4.649380840 192.168.202.2 192.168.201.1 TCP 68 679 → 2049 [ACK] Seq=46569 Ack=311465 Win=1444 Len=0 TSval=913651443 TSecr=2646321683 681 4.649716746 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 682) GETATTR FH: 0xb6d01f2a 682 4.671167708 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 681) GETATTR 683 4.671281003 192.168.202.2 192.168.201.1 NFS 288 V4 Call (Reply In 684) ACCESS FH: 0xb6d01f2a, [Check: RD LU MD XT DL] 684 4.692647455 192.168.201.1 192.168.202.2 NFS 240 V4 Reply (Call In 683) ACCESS, [Allowed: RD LU MD XT DL] 685 4.692825251 192.168.202.2 192.168.201.1 NFS 304 V4 Call (Reply In 690) READDIR FH: 0xb6d01f2a 686 4.715060586 192.168.201.1 192.168.202.2 TCP 1405 2049 → 679 [ACK] Seq=311881 Ack=47237 Win=1452 Len=1337 TSval=2646321748 TSecr=913651486 [TCP segment of a reassembled PDU] 687 4.715199557 192.168.201.1 192.168.202.2 TCP 1405 2049 → 679 [ACK] Seq=313218 Ack=47237 Win=1452 Len=1337 TSval=2646321748 TSecr=913651486 [TCP segment of a reassembled PDU] 688 4.715215055 192.168.202.2 192.168.201.1 TCP 68 679 → 2049 [ACK] Seq=47237 Ack=314555 Win=1444 Len=0 TSval=913651509 TSecr=2646321748 689 4.715524465 192.168.201.1 192.168.202.2 TCP 1405 2049 → 679 [ACK] Seq=314555 Ack=47237 Win=1452 Len=1337 TSval=2646321749 TSecr=913651486 [TCP segment of a reassembled PDU] 690 4.715911571 192.168.201.1 192.168.202.2 NFS 1449 V4 Reply (Call In 685) READDIR
NFS directory reads later:
No. Time Source Destination Protocol Length Info 664 9.485593049 192.168.202.2 192.168.201.1 NFS 304 V4 Call (Reply In 669) READDIR FH: 0x1933e99e 665 9.507596250 192.168.201.1 192.168.202.2 TCP 1405 2049 → 788 [ACK] Seq=127921 Ack=65730 Win=3076 Len=1337 TSval=2645776572 TSecr=913106316 [TCP segment of a reassembled PDU] 666 9.507717425 192.168.201.1 192.168.202.2 TCP 1405 2049 → 788 [ACK] Seq=129258 Ack=65730 Win=3076 Len=1337 TSval=2645776572 TSecr=913106316 [TCP segment of a reassembled PDU] 667 9.507733352 192.168.202.2 192.168.201.1 TCP 68 788 → 2049 [ACK] Seq=65730 Ack=130595 Win=1444 Len=0 TSval=913106338 TSecr=2645776572 668 9.507987020 192.168.201.1 192.168.202.2 TCP 1405 2049 → 788 [ACK] Seq=130595 Ack=65730 Win=3076 Len=1337 TSval=2645776572 TSecr=913106316 [TCP segment of a reassembled PDU] 669 9.508456847 192.168.201.1 192.168.202.2 NFS 989 V4 Reply (Call In 664) READDIR 670 9.508472149 192.168.202.2 192.168.201.1 TCP 68 788 → 2049 [ACK] Seq=65730 Ack=132853 Win=1444 Len=0 TSval=913106338 TSecr=2645776572 671 9.508880627 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 672) GETATTR FH: 0x7e9e8300 672 9.530375865 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 671) GETATTR 673 9.530564317 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 674) GETATTR FH: 0xcb837ac9 674 9.551906321 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 673) GETATTR 675 9.552064038 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 676) GETATTR FH: 0xbf951d32 676 9.574210528 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 675) GETATTR 677 9.574334117 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 678) GETATTR FH: 0xd3f3dc3e 678 9.595902902 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 677) GETATTR 679 9.596025484 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 680) GETATTR FH: 0xf534332a 680 9.617497794 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 679) GETATTR 681 9.617621218 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 682) GETATTR FH: 0xa7e5bbc5 682 9.639157371 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 681) GETATTR 683 9.639279098 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 684) GETATTR FH: 0xa8050515 684 9.660669335 192.168.201.1 192.168.202.2 NFS 312 V4 Reply (Call In 683) GETATTR 685 9.660787725 192.168.202.2 192.168.201.1 NFS 304 V4 Call (Reply In 686) READDIR FH: 0x7e9e8300 686 9.682612756 192.168.201.1 192.168.202.2 NFS 1472 V4 Reply (Call In 685) READDIR 687 9.682646761 192.168.202.2 192.168.201.1 TCP 68 788 → 2049 [ACK] Seq=67450 Ack=135965 Win=1444 Len=0 TSval=913106513 TSecr=2645776747 688 9.682906293 192.168.202.2 192.168.201.1 NFS 280 V4 Call (Reply In 689) GETATTR FH: 0xa8050515
Lots of GETATTR calls the second time around (each file ?).
Really NFS is really broken performance wise these days and it "appears" that significant/huge improvements are possible.
Anyone know what group/who is responsible for NFS protocol these days ?
Also what group/who is responsible for the Linux kernel's implementation of it ?
On Thu, Feb 01, 2018 at 08:29:49AM +0000, Terry Barnaby wrote:
- Have an OPEN-SETATTR-WRITE RPC call all in one and a SETATTR-CLOSE call
all in one. This would reduce the latency of a small file to 1ms rather than 3ms thus 66% faster. Would require the client to delay the OPEN/SETATTR until the first WRITE. Not sure how possible this is in the implementations. Maybe READ's could be improved as well but getting the OPEN through quick may be better in this case ?
- Could go further with an OPEN-SETATTR-WRITE-CLOSE RPC call. (0.5ms vs
3ms).
The protocol doesn't currently let us delay the OPEN like that, unfortunately.
What we can do that might help: we can grant a write delegation in the reply to the OPEN. In theory that should allow the following operations to be performed asynchronously, so the untar can immediately issue the next OPEN without waiting. (In practice I'm not sure what the current client will do.)
I'm expecting to get to write delegations this year....
It probably wouldn't be hard to hack the server to return write delegations even when that's not necessarily correct, just to get an idea what kind of speedup is available here.
- On sync/async modes personally I think it would be better for the client
to request the mount in sync/async mode. The setting of sync on the server side would just enforce sync mode for all clients. If the server is in the default async mode clients can mount using sync or async as to their requirements. This seems to match normal VFS semantics and usage patterns better.
The client-side and server-side options are both named "sync", but they aren't really related. The server-side "async" export option causes the server to lie to clients, telling them that data has reached disk even when it hasn't. This affects all clients, whether they mounted with "sync" or "async". It violates the NFS specs, so it is not the default.
I don't understand your proposal. It sounds like you believe that mounting on the client side with the "sync" option will make your data safe even if the "async" option is set on the server side? Unfortunately that's not how it works.
- The 0.5ms RPC latency seems a bit high (ICMP pings 0.12ms) . Maybe this
is worth investigating in the Linux kernel processing (how ?) ?
Yes, that'd be interesting to investigate. With some kernel tracing I think it should be possible to get high-resolution timings for the processing of a single RPC call, which would make a good start.
It'd probably also interesting to start with the simplest possible RPC and then work our way up and see when the RTT increases the most--e.g does an RPC ping (an RPC with procedure 0, empty argument and reply) already have a round-trip time closer to .5ms or .12ms?
- The 20ms RPC latency I see in sync mode needs a look at on my system
although async mode is fine for my usage. Maybe this ends up as 2 x 10ms drive seeks on ext4 and is thus expected.
Yes, this is why dedicated file servers have hardware designed to lower that latency.
As long as you're exporting with "async" and don't care about data safety across crashes or power outages, I guess you could go all the way and mount your ext4 export with "nobarrier", I *think* that will let the system acknowledge writes as soon as they reach the disk's write cache. I don't recommend that.
Just for fun I dug around a little for cheap options to get safe low-latency storage:
For Intel you can cross-reference this list:
https://ark.intel.com/Search/FeatureFilter?productType=solidstatedrives&...
of SSD's with "enhanced power loss data protection" (EPLDP) with shopping sites and I find e.g. this for US $121:
https://www.newegg.com/Product/Product.aspx?Item=9SIABVR66R5680
See the "device=" option in the ext4 man pages--you can use that to give your existing ext4 filesystem an external journal on that device. I think you want "data=journal" as well, then writes should normally be acknowledged once they hit that SSD's write cache, which should be quite quick.
I was also curious whether there were PCI SSDs, but the cheapest Intel SSD with EPLDP is the P4800X, at US $1600.
Intel Optane Memory is interesting as it starts at $70. It doesn't have EPLDP but latency of the underlying storage might be better even without that?
I haven't figured out how to get a similar list for other brands.
Just searching for "SSD power loss protection" on newegg:
This also claims "power loss protection" at $53, but I can't find any reviews:
https://www.newegg.com/Product/Product.aspx?Item=9SIA1K642V2376&cm_re=ss...
Or this?:
https://www.newegg.com/Product/Product.aspx?Item=N82E16820156153&cm_re=s...
This is another interesting discussion of the problem:
https://blogs.technet.microsoft.com/filecab/2016/11/18/dont-do-it-consumer-s...
--b.
On 05/02/18 23:06, J. Bruce Fields wrote:
On Thu, Feb 01, 2018 at 08:29:49AM +0000, Terry Barnaby wrote:
- Have an OPEN-SETATTR-WRITE RPC call all in one and a SETATTR-CLOSE call
all in one. This would reduce the latency of a small file to 1ms rather than 3ms thus 66% faster. Would require the client to delay the OPEN/SETATTR until the first WRITE. Not sure how possible this is in the implementations. Maybe READ's could be improved as well but getting the OPEN through quick may be better in this case ?
- Could go further with an OPEN-SETATTR-WRITE-CLOSE RPC call. (0.5ms vs
3ms).
The protocol doesn't currently let us delay the OPEN like that, unfortunately.
Yes, should have thought of that, to focused on network traces and not thinking about the program/OS API :) But maybe OPEN-SETATTR and SETATTR-CLOSE would be possible.
What we can do that might help: we can grant a write delegation in the reply to the OPEN. In theory that should allow the following operations to be performed asynchronously, so the untar can immediately issue the next OPEN without waiting. (In practice I'm not sure what the current client will do.)
I'm expecting to get to write delegations this year....
It probably wouldn't be hard to hack the server to return write delegations even when that's not necessarily correct, just to get an idea what kind of speedup is available here.
That sounds good. I will have to read up on NFS write delegations, not sure how they work. I guess write() errors would be returned later than they actually occurred etc. ?
- On sync/async modes personally I think it would be better for the client
to request the mount in sync/async mode. The setting of sync on the server side would just enforce sync mode for all clients. If the server is in the default async mode clients can mount using sync or async as to their requirements. This seems to match normal VFS semantics and usage patterns better.
The client-side and server-side options are both named "sync", but they aren't really related. The server-side "async" export option causes the server to lie to clients, telling them that data has reached disk even when it hasn't. This affects all clients, whether they mounted with "sync" or "async". It violates the NFS specs, so it is not the default.
I don't understand your proposal. It sounds like you believe that mounting on the client side with the "sync" option will make your data safe even if the "async" option is set on the server side? Unfortunately that's not how it works.
Well, when a program running on a system calls open(), write() etc. to the local disk FS the disk's contents is not actually updated. The data is in server buffers until the next sync/fsync or some time has passed. So, in your parlance, the OS write() call lies to the program. So it is by default async unless the "sync" mount option is used when mounting the particular file system in question.
Although it is different from the current NFS settings methods, I would have thought that this should be the same for NFS. So if a client mounts a file system normally it is async, ie write() data is in buffers somewhere (client or server) unless the client mounts the file system in sync mode. Only difference from the normal FS conventions I am suggesting is to allow the server to stipulate "sync" on its mount that forces sync mode for all clients on that FS. I know it is different from standard NFS config but it just seems more logical to me :) The sync/async option and the ramifications of it are really dependent of the clients usage in most cases.
In the case of a /home mount for example, or a source code build file system, it is normally only one client that is accessing the dir and if a write fails due to the server going down (an unlikely occurrence, its not much of an issue. I have only had this happen a couple of times in 28 years and then with no significant issues (power outage, disk fail pre-raid etc.).
I know that is not how NFS currently "works", it just seems illogical to me they way it currently does work :)
- The 0.5ms RPC latency seems a bit high (ICMP pings 0.12ms) . Maybe this
is worth investigating in the Linux kernel processing (how ?) ?
Yes, that'd be interesting to investigate. With some kernel tracing I think it should be possible to get high-resolution timings for the processing of a single RPC call, which would make a good start.
It'd probably also interesting to start with the simplest possible RPC and then work our way up and see when the RTT increases the most--e.g does an RPC ping (an RPC with procedure 0, empty argument and reply) already have a round-trip time closer to .5ms or .12ms?
Any pointers to trying this ? I have a small amount of time as work is quiet at the moment.
- The 20ms RPC latency I see in sync mode needs a look at on my system
although async mode is fine for my usage. Maybe this ends up as 2 x 10ms drive seeks on ext4 and is thus expected.
Yes, this is why dedicated file servers have hardware designed to lower that latency.
As long as you're exporting with "async" and don't care about data safety across crashes or power outages, I guess you could go all the way and mount your ext4 export with "nobarrier", I *think* that will let the system acknowledge writes as soon as they reach the disk's write cache. I don't recommend that.
Just for fun I dug around a little for cheap options to get safe low-latency storage:
For Intel you can cross-reference this list:
https://ark.intel.com/Search/FeatureFilter?productType=solidstatedrives&...
of SSD's with "enhanced power loss data protection" (EPLDP) with shopping sites and I find e.g. this for US $121:
https://www.newegg.com/Product/Product.aspx?Item=9SIABVR66R5680
See the "device=" option in the ext4 man pages--you can use that to give your existing ext4 filesystem an external journal on that device. I think you want "data=journal" as well, then writes should normally be acknowledged once they hit that SSD's write cache, which should be quite quick.
I was also curious whether there were PCI SSDs, but the cheapest Intel SSD with EPLDP is the P4800X, at US $1600.
Intel Optane Memory is interesting as it starts at $70. It doesn't have EPLDP but latency of the underlying storage might be better even without that?
I haven't figured out how to get a similar list for other brands.
Just searching for "SSD power loss protection" on newegg:
This also claims "power loss protection" at $53, but I can't find any reviews:
https://www.newegg.com/Product/Product.aspx?Item=9SIA1K642V2376&cm_re=ss...
Or this?:
https://www.newegg.com/Product/Product.aspx?Item=N82E16820156153&cm_re=s...
This is another interesting discussion of the problem:
https://blogs.technet.microsoft.com/filecab/2016/11/18/dont-do-it-consumer-s...
--b.
We have also found that SSD's or at least NAND flash has quite a few write latency peculiarities . We use eMMC NAND flash on a few embedded systems we have designed and the write latency patterns are a bit random and not well described/defined in datasheets etc. Difficult when you have an embedded system with small amounts of RAM doing real-time data capture !
Although using a low latency SSD drive could speed up NFS sync performance, I don't think it would affect NFS async write performance that much (already 50 - 100 x slower than normal HD access). It is the latency and the way the protocol works that is causing the most issue. Changing the NFS file system protocol/performance has much more scope for improvements with async mode and async mode I think is fine for most usage.
On Tue, Feb 06, 2018 at 08:18:27PM +0000, Terry Barnaby wrote:
Well, when a program running on a system calls open(), write() etc. to the local disk FS the disk's contents is not actually updated. The data is in server buffers until the next sync/fsync or some time has passed. So, in your parlance, the OS write() call lies to the program. So it is by default async unless the "sync" mount option is used when mounting the particular file system in question.
That's right, but note applications are written with the knowledge that OS's behave this way, and are given tools (sync, fsync, etc.) to manage this behavior so that they still have some control over what survives a crash.
(But sync & friends no longer do what they're supposed to on an Linux server exporting with async.)
Although it is different from the current NFS settings methods, I would have thought that this should be the same for NFS. So if a client mounts a file system normally it is async, ie write() data is in buffers somewhere (client or server) unless the client mounts the file system in sync mode.
In fact, this is pretty much how it works, for write().
It didn't used to be that way--NFSv2 writes were all synchronous.
The problem is that if a server power cycles while it still had dirty data in its caches, what should you do?
You can't ignore it--you'd just be silently losing data. You could return an error at some point, but "we just lost some or your idea, no idea what" isn't an error an application can really act on.
So NFSv3 introduced a separation of write into WRITE and COMMIT. The client first sends a WRITE with the data, then latter sends a COMMIT call that says "please don't return till that data I sent before is actually on disk".
If the server reboots, there's a limited set of data that the client needs to resend to recover (just data that's been written but not committed.)
But we only have that for file data, metadata would be more complicated, so stuff like file creates, setattr, directory operations, etc., are still synchronous.
Only difference from the normal FS conventions I am suggesting is to allow the server to stipulate "sync" on its mount that forces sync mode for all clients on that FS.
Anyway, we don't have protocol to tell clients to do that.
In the case of a /home mount for example, or a source code build file system, it is normally only one client that is accessing the dir and if a write fails due to the server going down (an unlikely occurrence, its not much of an issue. I have only had this happen a couple of times in 28 years and then with no significant issues (power outage, disk fail pre-raid etc.).
So if you have reliable servers and power, maybe you're comfortable with the risk. There's a reason that's not the default, though.
- The 0.5ms RPC latency seems a bit high (ICMP pings 0.12ms) . Maybe this
is worth investigating in the Linux kernel processing (how ?) ?
Yes, that'd be interesting to investigate. With some kernel tracing I think it should be possible to get high-resolution timings for the processing of a single RPC call, which would make a good start.
It'd probably also interesting to start with the simplest possible RPC and then work our way up and see when the RTT increases the most--e.g does an RPC ping (an RPC with procedure 0, empty argument and reply) already have a round-trip time closer to .5ms or .12ms?
Any pointers to trying this ? I have a small amount of time as work is quiet at the moment.
Hm. I wonder if testing over loopback would give interesting enough results. That might simplify testing even if it's not as realistic. You could start by seeing if latency is still similar.
You could start by googling around for "ftrace", I think lwn.net's articles were pretty good introductions.
I don't do this very often and don't have good step-by-step instructions....
I beleive the simplest way to do it was using "trace-cmd" (which is packaged for fedora in a package of the same name). The man page looks skimpy, but https://lwn.net/Articles/410200/ looks good. Maybe run it while just stat-ing a single file on an NFS partition as a start.
I don't know if that will result in too much data. Figuring out how to filter it may be tricky. Tracing everything may be prohibitive. Several processes are involved so you don't want to restrict by process. Maybe restricting to functions in nfsd and sunrpc modules would work, with something like -l ':mod:nfs' -l ':mod:sunrpc'.
We have also found that SSD's or at least NAND flash has quite a few write latency peculiarities . We use eMMC NAND flash on a few embedded systems we have designed and the write latency patterns are a bit random and not well described/defined in datasheets etc. Difficult when you have an embedded system with small amounts of RAM doing real-time data capture !
That's one of the reasons you want the "enterprise" drives with power loss protection--they let you just write to cache, so write-behind and gathering of writes into erase-block-sized writes to flash should allow the firmware to hide weird flash latency from you.
A few years ago (and I have poor notes, so take this with a grain of salt) I tested the same untar-a-kernel-workload using an external journal on an SSD without that feature, and found it didn't offer any improvement.
Although using a low latency SSD drive could speed up NFS sync performance, I don't think it would affect NFS async write performance that much (already 50 - 100 x slower than normal HD access).
You need to specify "file creates" here--over NFS that has very different performance characteristics. Ordinary file writes should still be able to saturate the network and/or disk in most cases.
If you want a protocol that makes no distinction between metadata and data, and if you *really* don't do any sharing between clients, then another option is to use a block protocol (iscsi or something). That will have different drawbacks.
It is the latency and the way the protocol works that is causing the most issue.
Sure. The protocol issues are probably more complicated than they first appear, though!
--b.
On 06/02/18 21:48, J. Bruce Fields wrote:
On Tue, Feb 06, 2018 at 08:18:27PM +0000, Terry Barnaby wrote:
Well, when a program running on a system calls open(), write() etc. to the local disk FS the disk's contents is not actually updated. The data is in server buffers until the next sync/fsync or some time has passed. So, in your parlance, the OS write() call lies to the program. So it is by default async unless the "sync" mount option is used when mounting the particular file system in question.
That's right, but note applications are written with the knowledge that OS's behave this way, and are given tools (sync, fsync, etc.) to manage this behavior so that they still have some control over what survives a crash.
(But sync & friends no longer do what they're supposed to on an Linux server exporting with async.)
Doesn't fsync() and perhaps sync() work across NFS then when the server has an async export, I thought they did along with file locking to some extent ?
Although it is different from the current NFS settings methods, I would have thought that this should be the same for NFS. So if a client mounts a file system normally it is async, ie write() data is in buffers somewhere (client or server) unless the client mounts the file system in sync mode.
In fact, this is pretty much how it works, for write().
It didn't used to be that way--NFSv2 writes were all synchronous.
The problem is that if a server power cycles while it still had dirty data in its caches, what should you do? You can't ignore it--you'd just be silently losing data. You could return an error at some point, but "we just lost some or your idea, no idea what" isn't an error an application can really act on.
Yes, it is tricky error handling. But what does a program do when its local hard disk disk or machine dies underneath it anyway ? I don't think a program on a remote system is particularly worse off if the NFS server dies, it may have to die if it can't do any special recovery. If it was important to get the data to disk it would have been using fsync(), FS sync, or some other transaction based approach, indeed it shouldn't be using network remote disk mounts anyway. It all depends on what the program is doing and its usage requirements. A cc failing one in a blue moon is not a real issue (as long as it fails and removes its created files or at least a make clean can be run). As I have said I have used NFS async for about 27+ years on multiple systems with no problems when servers die with the type of usage I use NFS for. The number of times a server has died is low in that time. Client systems have died many many more times (User issues, experimental programs/kernels, random program usage, single cheap disks, cheaper non ECC RAM etc.)
So NFSv3 introduced a separation of write into WRITE and COMMIT. The client first sends a WRITE with the data, then latter sends a COMMIT call that says "please don't return till that data I sent before is actually on disk".
If the server reboots, there's a limited set of data that the client needs to resend to recover (just data that's been written but not committed.)
But we only have that for file data, metadata would be more complicated, so stuff like file creates, setattr, directory operations, etc., are still synchronous.
Only difference from the normal FS conventions I am suggesting is to allow the server to stipulate "sync" on its mount that forces sync mode for all clients on that FS.
Anyway, we don't have protocol to tell clients to do that.
As I said NFSv4.3 :)
In the case of a /home mount for example, or a source code build file system, it is normally only one client that is accessing the dir and if a write fails due to the server going down (an unlikely occurrence, its not much of an issue. I have only had this happen a couple of times in 28 years and then with no significant issues (power outage, disk fail pre-raid etc.).
So if you have reliable servers and power, maybe you're comfortable with the risk. There's a reason that's not the default, though.
Well, it is the default for local FS mounts so I really don't see why it should be different for network mounts. But anyway for my usage NFS sync is completely unusable (as would local sync mounts) so it has to be async NFS or local disks (13 secs local disk -> 3mins NFS async-> 2 hours NFS sync). I would have thought that would go for the majority of NFS usage. No issue to me though as long as async can be configured and works well :)
- The 0.5ms RPC latency seems a bit high (ICMP pings 0.12ms) . Maybe this
is worth investigating in the Linux kernel processing (how ?) ?
Yes, that'd be interesting to investigate. With some kernel tracing I think it should be possible to get high-resolution timings for the processing of a single RPC call, which would make a good start.
It'd probably also interesting to start with the simplest possible RPC and then work our way up and see when the RTT increases the most--e.g does an RPC ping (an RPC with procedure 0, empty argument and reply) already have a round-trip time closer to .5ms or .12ms?
Any pointers to trying this ? I have a small amount of time as work is quiet at the moment.
Hm. I wonder if testing over loopback would give interesting enough results. That might simplify testing even if it's not as realistic. You could start by seeing if latency is still similar.
You could start by googling around for "ftrace", I think lwn.net's articles were pretty good introductions.
I don't do this very often and don't have good step-by-step instructions....
I beleive the simplest way to do it was using "trace-cmd" (which is packaged for fedora in a package of the same name). The man page looks skimpy, buthttps://lwn.net/Articles/410200/ looks good. Maybe run it while just stat-ing a single file on an NFS partition as a start.
I don't know if that will result in too much data. Figuring out how to filter it may be tricky. Tracing everything may be prohibitive. Several processes are involved so you don't want to restrict by process. Maybe restricting to functions in nfsd and sunrpc modules would work, with something like -l ':mod:nfs' -l ':mod:sunrpc'.
Thanks for the ideas, I will try and have a play.
We have also found that SSD's or at least NAND flash has quite a few write latency peculiarities . We use eMMC NAND flash on a few embedded systems we have designed and the write latency patterns are a bit random and not well described/defined in datasheets etc. Difficult when you have an embedded system with small amounts of RAM doing real-time data capture !
That's one of the reasons you want the "enterprise" drives with power loss protection--they let you just write to cache, so write-behind and gathering of writes into erase-block-sized writes to flash should allow the firmware to hide weird flash latency from you.
A few years ago (and I have poor notes, so take this with a grain of salt) I tested the same untar-a-kernel-workload using an external journal on an SSD without that feature, and found it didn't offer any improvement.
Although using a low latency SSD drive could speed up NFS sync performance, I don't think it would affect NFS async write performance that much (already 50 - 100 x slower than normal HD access).
You need to specify "file creates" here--over NFS that has very different performance characteristics. Ordinary file writes should still be able to saturate the network and/or disk in most cases.
If you want a protocol that makes no distinction between metadata and data, and if you *really* don't do any sharing between clients, then another option is to use a block protocol (iscsi or something). That will have different drawbacks.
It is the latency and the way the protocol works that is causing the most issue.
Sure. The protocol issues are probably more complicated than they first appear, though!
Yes, they probably are, most things are below the surface, but I still think there are likely to be a lot of improvements that could be made that would make using NFS async more tenable to the user. If necessary local file caching (to local disk) with delayed NFS writes. I do use fscache for the NFS - OpenVPN - FTTP mounts, but the NFS caching time tests probably hit the performance of this for reads and I presume writes would be write through rather than delayed write. Haven't actually looked at the performance of this and I know there are other network file systems that may be more suited in that case.
--b.
On Thu, Feb 08, 2018 at 08:21:44PM +0000, Terry Barnaby wrote:
Doesn't fsync() and perhaps sync() work across NFS then when the server has an async export,
No.
On a local filesystem, a file create followed by a sync will ensure the file create reaches disk. Normally on NFS, the same is true--for the trivial reason that the file create already ensured this. If your server is Linux knfsd exporting the filesystem with async, the file create may still not be on disk after the sync.
I don't think a program on a remote system is particularly worse off if the NFS server dies, it may have to die if it can't do any special recovery.
Well-written applications should be able to deal with recovering after a crash, *if* the filesystem respects fsync() and friends. If the filesystem ignores them and loses data silently, the application is left in a rather more difficult position!
Only difference from the normal FS conventions I am suggesting is to allow the server to stipulate "sync" on its mount that forces sync mode for all clients on that FS.
Anyway, we don't have protocol to tell clients to do that.
As I said NFSv4.3 :)
Protocol extensions are certainly possible.
So if you have reliable servers and power, maybe you're comfortable with the risk. There's a reason that's not the default, though.
Well, it is the default for local FS mounts so I really don't see why it should be different for network mounts.
It's definitely not the default for local mounts to ignore sync(). So, you understand why I say that the "async" export option is very different from the mount option with the same name. (Yes, the name was a mistake.) And you can see why a filesystem engineer would get nervous about recommending that configuration.
But anyway for my usage NFS sync is completely unusable (as would local sync mounts) so it has to be async NFS or local disks (13 secs local disk -> 3mins NFS async-> 2 hours NFS sync). I would have thought that would go for the majority of NFS usage. No issue to me though as long as async can be configured and works well :)
So, instead what I personally use is a hardware configuration that allow me to get similar performance while still using the default export options.
Sure. The protocol issues are probably more complicated than they first appear, though!
Yes, they probably are, most things are below the surface, but I still think there are likely to be a lot of improvements that could be made that would make using NFS async more tenable to the user. If necessary local file caching (to local disk) with delayed NFS writes. I do use fscache for the NFS - OpenVPN - FTTP mounts, but the NFS caching time tests probably hit the performance of this for reads and I presume writes would be write through rather than delayed write. Haven't actually looked at the performance of this and I know there are other network file systems that may be more suited in that case.
fscache doesn't remove the need for synchronous file creates.
So, in the existing protocol write delegations are probably what would help most; which is why they're near the top of my todo list.
But write delegations just cover file data and attributes. If we want a client to be able to, for example, respond to creat() with success, we want write delegations on *directories*. That's rather more complicated, and we currently don't even have protocol proposed for that. It's been proposed in the past and I hope there may be sufficient time and motivation to make it happen some day....
--b.
----- Mail original ----- De: "Terry Barnaby"
If it was important to get the data to disk it would have been using fsync(), FS sync, or some other transaction based app
??? Many people use NFS NAS because doing RAID+Backup on every client is too expensive. So yes, they *are* using NFS because it is important to get the data to disk.
Regards,
On 09/02/18 08:25, nicolas.mailhot@laposte.net wrote:
----- Mail original ----- De: "Terry Barnaby"
If it was important to get the data to disk it would have been using fsync(), FS sync, or some other transaction based app
??? Many people use NFS NAS because doing RAID+Backup on every client is too expensive. So yes, they *are* using NFS because it is important to get the data to disk.
Regards,
Yes, that is why I said some people would be using "FS sync". These people would use the sync option, but then they would use "sync" mount option, (ideally this would be set on the NFS client as the clients know they need this). Personally we use rsync via a rsync server or over ssh for backups like this as NFS sync would be far too slow and rsync provides an easy incremental mode plus other benefits.
On Mon, Feb 12, 2018 at 09:08:47AM +0000, Terry Barnaby wrote:
On 09/02/18 08:25, nicolas.mailhot@laposte.net wrote:
----- Mail original ----- De: "Terry Barnaby"
If it was important to get the data to disk it would have been using fsync(), FS sync, or some other transaction based app
??? Many people use NFS NAS because doing RAID+Backup on every client is too expensive. So yes, they *are* using NFS because it is important to get the data to disk.
Regards,
Yes, that is why I said some people would be using "FS sync". These people would use the sync option, but then they would use "sync" mount option, (ideally this would be set on the NFS client as the clients know they need this).
The "sync" mount option should not be necessary for data safety. Carefully written apps know how to use fsync() and related calls at points where they need data to be durable.
The server-side "async" export option, on the other hand, undermines exactly those calls and therefore can result in lost or corrupted data on a server crash, no matter how careful the application.
Again, we need to be very careful to distinguish between the client-side "sync" mount option and the server-side "sync" export option.
--b.
On 12/02/18 17:06, J. Bruce Fields wrote:
On Mon, Feb 12, 2018 at 09:08:47AM +0000, Terry Barnaby wrote:
On 09/02/18 08:25, nicolas.mailhot@laposte.net wrote:
----- Mail original ----- De: "Terry Barnaby"
If it was important to get the data to disk it would have been using fsync(), FS sync, or some other transaction based app
??? Many people use NFS NAS because doing RAID+Backup on every client is too expensive. So yes, they *are* using NFS because it is important to get the data to disk.
Regards,
Yes, that is why I said some people would be using "FS sync". These people would use the sync option, but then they would use "sync" mount option, (ideally this would be set on the NFS client as the clients know they need this).
The "sync" mount option should not be necessary for data safety. Carefully written apps know how to use fsync() and related calls at points where they need data to be durable.
The server-side "async" export option, on the other hand, undermines exactly those calls and therefore can result in lost or corrupted data on a server crash, no matter how careful the application.
Again, we need to be very careful to distinguish between the client-side "sync" mount option and the server-side "sync" export option.
--b.
One thing on this, that I forgot to ask, doesn't fsync() work properly with an NFS server side async mount then ? I would have thought this would still work correctly.
On Mon, Feb 12, 2018 at 05:09:32PM +0000, Terry Barnaby wrote:
One thing on this, that I forgot to ask, doesn't fsync() work properly with an NFS server side async mount then ?
No.
If a server sets "async" on an export, there is absolutely no way for a client to guarantee that data reaches disk, or to know when it happens.
Possibly "ignore_sync", or "unsafe_sync", or something else, would be a better name.
--b.
On 12/02/18 17:15, J. Bruce Fields wrote:
On Mon, Feb 12, 2018 at 05:09:32PM +0000, Terry Barnaby wrote:
One thing on this, that I forgot to ask, doesn't fsync() work properly with an NFS server side async mount then ?
No.
If a server sets "async" on an export, there is absolutely no way for a client to guarantee that data reaches disk, or to know when it happens.
Possibly "ignore_sync", or "unsafe_sync", or something else, would be a better name.
--b.
Well that seems like a major drop off, I always thought that fsync() would work in this case. I don't understand why fsync() should not operate as intended ? Sounds like this NFS async thing needs some work !
I still do not understand why NFS doesn't operate in the same way as a standard mount on this. The use for async is only for improved performance due to disk write latency and speed (or are there other reasons ?)
So with a local system mount:
async: normal mode: All system calls manipulate in buffer memory disk structure (inodes etc). Data/Metadata is flushed to disk on fsync(), sync() and occasionally by kernel. Processes data is not actually stored until fsync(), sync() etc.
sync: with sync option. Data/metadata is written to disk before system calls return (all FS system calls ?).
With an NFS mount I would have thought it should be the same.
async: normal mode: All system calls manipulate in buffer memory disk structure (inodes etc) this would normally be on the server (so multiple clients can work with the same data) but with some options (particular usage) maybe client side write buffering/caching could be used (ie. data would not actually pass to server during every FS system call). Data/Metadata is flushed to server disk on fsync(), sync() and occasionally by kernel (If client side write caching is used flushes across network and then flushes server buffers). Processes data is not actually stored until fsync(), sync() etc.
sync: with client side sync option. Data/metadata is written across NFS and to Server disk before system calls return (all FS system calls ?).
I really don't understand why the async option is implemented on the server export although a sync option here could force sync for all clients for that mount. What am I missing ? Is there some good reason (rather than history) it is done this way ?
On 12/02/18 17:35, Terry Barnaby wrote:
On 12/02/18 17:15, J. Bruce Fields wrote:
On Mon, Feb 12, 2018 at 05:09:32PM +0000, Terry Barnaby wrote:
One thing on this, that I forgot to ask, doesn't fsync() work properly with an NFS server side async mount then ?
No.
If a server sets "async" on an export, there is absolutely no way for a client to guarantee that data reaches disk, or to know when it happens.
Possibly "ignore_sync", or "unsafe_sync", or something else, would be a better name.
--b.
Well that seems like a major drop off, I always thought that fsync() would work in this case. I don't understand why fsync() should not operate as intended ? Sounds like this NFS async thing needs some work !
I still do not understand why NFS doesn't operate in the same way as a standard mount on this. The use for async is only for improved performance due to disk write latency and speed (or are there other reasons ?)
So with a local system mount:
async: normal mode: All system calls manipulate in buffer memory disk structure (inodes etc). Data/Metadata is flushed to disk on fsync(), sync() and occasionally by kernel. Processes data is not actually stored until fsync(), sync() etc.
sync: with sync option. Data/metadata is written to disk before system calls return (all FS system calls ?).
With an NFS mount I would have thought it should be the same.
async: normal mode: All system calls manipulate in buffer memory disk structure (inodes etc) this would normally be on the server (so multiple clients can work with the same data) but with some options (particular usage) maybe client side write buffering/caching could be used (ie. data would not actually pass to server during every FS system call). Data/Metadata is flushed to server disk on fsync(), sync() and occasionally by kernel (If client side write caching is used flushes across network and then flushes server buffers). Processes data is not actually stored until fsync(), sync() etc.
sync: with client side sync option. Data/metadata is written across NFS and to Server disk before system calls return (all FS system calls ?).
I really don't understand why the async option is implemented on the server export although a sync option here could force sync for all clients for that mount. What am I missing ? Is there some good reason (rather than history) it is done this way ?
Just tried the use of fsync() with an NFS async mount, it appears to work. With a simple 'C' program as a test program I see the following data rates/times when the program writes 100 MBytes to a single file over NFS (open, write, write .., fsync) followed by close (after the timing):
NFS Write multiple small files 0.001584 ms/per file 0.615829 MBytes/sec CpuUsage: 3.2% Disktest: Writing/Reading 100.00 MBytes in 1048576 Byte Chunks Disk Write sequential data rate fsync: 1 107.250685 MBytes/sec CpuUsage: 13.4% Disk Write sequential data rate fsync: 0 4758.953878 MBytes/sec CpuUsage: 66.7%
Without the fsync() call the data rate is obviously to buffers and with the fsync() call it definitely looks like it is to disk.
Interestingly, it appears, that the close() call actually does an effective fsync() as well as the close() takes an age when fsync() is not used.
(By the way just go bitten by a Fedora27 KDE/plasma/NetworkManager change that sets the Ethernet interfaces of all my systems to 100 MBits/s half duplex. Looks like the ability to configure Ethernet auto negotiation has been added and the default is fixed 100 MBits/s half duplex !)
Basic test code (just the write function):
void nfsPerfWrite(int doFsync){ int f; char buf[bufSize]; int n; double st, et, r; int nb; int numBuf; CpuStat cpuStatStart; CpuStat cpuStatEnd; double cpuUsed; double cpuUsage;
sync(); f = open64(fileName, O_RDWR | O_CREAT, 0666); if(f < 0){ fprintf(stderr, "Error creating %s: %s\n", fileName, strerror(errno)); return; }
sync(); cpuStatGet(&cpuStatStart); st = getTime(); for(n = 0; n < diskNum; n++){ if((nb = write(f, buf, bufSize)) != bufSize) fprintf(stderr, "WriteError: %d\n", nb); }
if(doFsync) fsync(f);
et = getTime(); cpuStatGet(&cpuStatEnd);
cpuStatEnd.user = cpuStatEnd.user - cpuStatStart.user; cpuStatEnd.nice = cpuStatEnd.nice - cpuStatStart.nice; cpuStatEnd.sys = cpuStatEnd.sys - cpuStatStart.sys; cpuStatEnd.idle = cpuStatEnd.idle - cpuStatStart.idle; cpuStatEnd.wait = cpuStatEnd.wait - cpuStatStart.wait; cpuStatEnd.hi = cpuStatEnd.hi - cpuStatStart.hi; cpuStatEnd.si = cpuStatEnd.si - cpuStatStart.si;
cpuUsed = (cpuStatEnd.user + cpuStatEnd.nice + cpuStatEnd.sys + cpuStatEnd.hi + cpuStatEnd.si); cpuUsage = cpuUsed / (cpuUsed + cpuStatEnd.idle);
r = (double(diskNum) * bufSize) / (et - st); printf("Disk Write sequential data rate fsync: %d %f MBytes/sec CpuUsage: %.1f%\n", doFsync, r / (1024*1024), cpuUsage * 100); close(f); }
On Mon, Feb 12, 2018 at 08:12:58PM +0000, Terry Barnaby wrote:
On 12/02/18 17:35, Terry Barnaby wrote:
On 12/02/18 17:15, J. Bruce Fields wrote:
On Mon, Feb 12, 2018 at 05:09:32PM +0000, Terry Barnaby wrote:
One thing on this, that I forgot to ask, doesn't fsync() work properly with an NFS server side async mount then ?
No.
If a server sets "async" on an export, there is absolutely no way for a client to guarantee that data reaches disk, or to know when it happens.
Possibly "ignore_sync", or "unsafe_sync", or something else, would be a better name.
...
Just tried the use of fsync() with an NFS async mount, it appears to work.
That's expected, it's the *export* option that cheats, not the mount option.
Also, even if you're using the async export option--fsync will still flush data to server memory, just not necessarily to disk.
With a simple 'C' program as a test program I see the following data rates/times when the program writes 100 MBytes to a single file over NFS (open, write, write .., fsync) followed by close (after the timing):
NFS Write multiple small files 0.001584 ms/per file 0.615829 MBytes/sec CpuUsage: 3.2% Disktest: Writing/Reading 100.00 MBytes in 1048576 Byte Chunks Disk Write sequential data rate fsync: 1 107.250685 MBytes/sec CpuUsage: 13.4% Disk Write sequential data rate fsync: 0 4758.953878 MBytes/sec CpuUsage: 66.7%
Without the fsync() call the data rate is obviously to buffers and with the fsync() call it definitely looks like it is to disk.
Could be, or you could be network-limited, hard to tell without knowing more.
Interestingly, it appears, that the close() call actually does an effective fsync() as well as the close() takes an age when fsync() is not used.
Yes: http://nfs.sourceforge.net/#faq_a8
--b.
On 12/02/18 22:14, J. Bruce Fields wrote:
On Mon, Feb 12, 2018 at 08:12:58PM +0000, Terry Barnaby wrote:
On 12/02/18 17:35, Terry Barnaby wrote:
On 12/02/18 17:15, J. Bruce Fields wrote:
On Mon, Feb 12, 2018 at 05:09:32PM +0000, Terry Barnaby wrote:
One thing on this, that I forgot to ask, doesn't fsync() work properly with an NFS server side async mount then ?
No.
If a server sets "async" on an export, there is absolutely no way for a client to guarantee that data reaches disk, or to know when it happens.
Possibly "ignore_sync", or "unsafe_sync", or something else, would be a better name.
...
Just tried the use of fsync() with an NFS async mount, it appears to work.
That's expected, it's the *export* option that cheats, not the mount option.
Also, even if you're using the async export option--fsync will still flush data to server memory, just not necessarily to disk.
With a simple 'C' program as a test program I see the following data rates/times when the program writes 100 MBytes to a single file over NFS (open, write, write .., fsync) followed by close (after the timing):
NFS Write multiple small files 0.001584 ms/per file 0.615829 MBytes/sec CpuUsage: 3.2% Disktest: Writing/Reading 100.00 MBytes in 1048576 Byte Chunks Disk Write sequential data rate fsync: 1 107.250685 MBytes/sec CpuUsage: 13.4% Disk Write sequential data rate fsync: 0 4758.953878 MBytes/sec CpuUsage: 66.7%
Without the fsync() call the data rate is obviously to buffers and with the fsync() call it definitely looks like it is to disk.
Could be, or you could be network-limited, hard to tell without knowing more.
Interestingly, it appears, that the close() call actually does an effective fsync() as well as the close() takes an age when fsync() is not used.
Yes: http://nfs.sourceforge.net/#faq_a8
--b.
Quite right, it was network limited (disk vs network speed is about the same). Using a slower USB stick disk shows that fsync() is not working with a NFSv4 "async" export.
But why is this ? It just doesn't make sense to me that fsync() should work this way even with an NFS "async" export ? Why shouldn't it do the right thing "synchronize a file's in-core state with storage device" (I don't consider an NFS server a storage device only the non volatile devices it uses). It seems it would be easy to flush the clients write buffer to the NFS server (as it does now) and then perform the fsync() on the server for the file in question. What am I missing ?
Thinking out loud (and without a great deal of thought), on removing the NFS export "async" option, improving write small files performance and keeping data security it seems to me one method might be:
1. NFS server is always in "async" export mode (Client can mount in sync mode if wanted). Data and metadata (optionally) is buffered in RAM on client and server.
2. Client fsync() works all the way to disk on the server.
3. Client sync() does an fsync() of each open for write NFS file. (Maybe this will be too much load on NFS servers ...)
4. You implement NFSv4 write delegations :)
5. There is a transaction based system for file writes:
5.1 When a file is opened for write, a transaction is created (id). This is sent with the OPEN call.
5.2 Further file operations including SETATTR, WRITE are allocated as stages in this transaction (id.stage) and are just buffered in the client (no direct server RPC calls).
5.3 The client sends the NFS operations for this write, as and when, optimised into full sized network packets to the server. But the data and metadata are kept buffered in the client.
5.4 The server stores the data in its normal FS RAM buffers during the NFS RPC calls.
5.5 When the server actually writes the data to disk (using its normal optimised disk writing system for the file system and device in question), the transaction and stage (id.stage) are returned to the client (within an NFS reply). The client can now release the buffers up to this stage in the transaction.
The transaction system allows the write delegation to send the data to the servers RAM without the overhead of synchronous writes to the disk.
It does mean the data is stored in RAM in both the client and server at the same time (twice as much RAM usage). Not sure how easy it would be to implement in the Linux kernel (NFS informed on FS buffer free ?) and would require NFS protocol extensions for the transactions.
With this method the client can resend the data on a server fail/reboot and the data can be ensured to be on the disk after an fsync(), sync() (within reason!). It should offer the fastest write performance and should eliminate the untar performance issue with small file creation/writes and still be relatively secure with data if the server dies. Unless I am missing something ?
PS: I have some RPC latency figures for some other NFS servers at work. The NFS RPC latency on some of them is nearer the ICMP ping times, ie about 100us. Maybe quite a bit of CPU is needed to respond to an NFS RPC call these days. The 500us RPC time was on a oldish home server using an Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz.
Terry
On 13 February 2018 at 02:01, Terry Barnaby terry1@beam.ltd.uk wrote:
Yes: http://nfs.sourceforge.net/#faq_a8
--b.
Quite right, it was network limited (disk vs network speed is about the same). Using a slower USB stick disk shows that fsync() is not working with a NFSv4 "async" export.
But why is this ? It just doesn't make sense to me that fsync() should work this way even with an NFS "async" export ? Why shouldn't it do the right thing "synchronize a file's in-core state with storage device" (I don't consider an NFS server a storage device only the non volatile devices it uses). It seems it would be easy to flush the clients write buffer to the NFS server (as it does now) and then perform the fsync() on the server for the file in question. What am I missing ?
You seem to be missing the part where several people have told you that the async option in the server is misnamed. NFS server async() was named that ~20 years ago (?) to match requirements from sites that wanted NFSv2 'look and feel' with NFSv3 and has been constantly called that since because changing it would break people's setups.
What you are wanting may be useful as a renamed and different feature, but it really needs to be done on the NFS kernel mailing list versus here. While you have 2 NFS oriented kernel developers here, they are only a subset of the people who would need to look at it and see if it could be done.
On Tue, Feb 13, 2018 at 07:01:22AM +0000, Terry Barnaby wrote:
The transaction system allows the write delegation to send the data to the servers RAM without the overhead of synchronous writes to the disk.
As far as I'm concerned this problem is already solved--did you miss the discussion of WRITE/COMMIT in other email?
The problem you're running into is with metadata (file creates) more than data.
PS: I have some RPC latency figures for some other NFS servers at work. The NFS RPC latency on some of them is nearer the ICMP ping times, ie about 100us. Maybe quite a bit of CPU is needed to respond to an NFS RPC call these days. The 500us RPC time was on a oldish home server using an Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz.
Tracing to figure ou the source of the latency might still be interesting.
--b.
On 15/02/18 16:48, J. Bruce Fields wrote:
On Tue, Feb 13, 2018 at 07:01:22AM +0000, Terry Barnaby wrote:
The transaction system allows the write delegation to send the data to the servers RAM without the overhead of synchronous writes to the disk.
As far as I'm concerned this problem is already solved--did you miss the discussion of WRITE/COMMIT in other email?
The problem you're running into is with metadata (file creates) more than data.
Not quite, I think, unless I missed something ? With the transaction method on top of write delegations the NFS server is always effectively in "async" mode so actual disk writes are asynchronous with the final real NFS WRITE's from the write delegation and thus the disk writes can be optimised by the OS as normal, as and when, without the data being lost if the server dies. So no requirement for the bottleneck of server side "sync" mode and special disk systems unless needed for a particular requirement. As this method protects the data, even the metadata can be passed through asynchronously apart from the original open, assuming the particular requirements are such that other clients don't need access to this metadata live. As far as I can see, with only quick thoughts so may be missing something major (!), this method could almost fully remove latency issues with NFS writes with small files while using conventional disk systems in the server. As someone suggested that this is not the right place for this sort of discussion, I will try and start a discussion on the NFS kernel mailing when I have some time. It would be good to see an improved performance with NFS writes with small files while still being secure and also find out why fsync() does not work when the NFS export is in "async" mode!
PS: I have some RPC latency figures for some other NFS servers at work. The NFS RPC latency on some of them is nearer the ICMP ping times, ie about 100us. Maybe quite a bit of CPU is needed to respond to an NFS RPC call these days. The 500us RPC time was on a oldish home server using an Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz.
Tracing to figure ou the source of the latency might still be interesting.
Will see if I can find some time to do this, away for a bit at the moment.
--b.
On Mon, Feb 12, 2018 at 05:35:49PM +0000, Terry Barnaby wrote:
Well that seems like a major drop off, I always thought that fsync() would work in this case.
No, it never has.
I don't understand why fsync() should not operate as intended ? Sounds like this NFS async thing needs some work !
By "NFS async" I assume you mean the export option. Believe me, I'd remove it entirely if I thought I could get away with it.....
I still do not understand why NFS doesn't operate in the same way as a standard mount on this. The use for async is only for improved performance due to disk write latency and speed (or are there other reasons ?)
Reasons for the async export option? Historically I believe it was a workaround for the fact that NFSv2 didn't have COMMIT, so even writes of ordinary file data suffered from the problem that metadata-modifying operations still have today.
So with a local system mount:
async: normal mode: All system calls manipulate in buffer memory disk structure (inodes etc). Data/Metadata is flushed to disk on fsync(), sync() and occasionally by kernel. Processes data is not actually stored until fsync(), sync() etc.
sync: with sync option. Data/metadata is written to disk before system calls return (all FS system calls ?).
With an NFS mount I would have thought it should be the same.
As a distributed filesystem which aims to survive server reboots, it's more complicated.
async: normal mode: All system calls manipulate in buffer memory disk structure (inodes etc) this would normally be on the server (so multiple clients can work with the same data) but with some options (particular usage) maybe client side write buffering/caching could be used (ie. data would not actually pass to server during every FS system call).
Definitely required if you want to, for example, be able to use the full network bandwidth when writing data to a file.
Data/Metadata is flushed to server disk on fsync(), sync() and occasionally by kernel (If client side write caching is used flushes across network and then flushes server buffers). Processes data is not actually stored until fsync(), sync() etc.
I'd be nervous about the idea of a lot unsync'd metadata changes sitting around in server memory. On server crash/restart that's a bunch of files and directories that are visible to every client, and that vanish without anyone actually deleting them. I wonder what the consequences would be?
This is something that can only happen on a distributed filesystem: on ext4, a crash takes down all the users of the filesystem too....
(Thinking about this: don't we already have a tiny window during the rpc processing, after a change has been made but before it's been committed, when a server crash could make the change vanish? But, no, actually, I believe we hold a lock on the parent directory in every such case, preventing anyone from seeing the change till the commit has finished.)
Also, delegations potentially hide both network and disk latency, whereas your proposal only hides disk latency. The latter is more important in your case. I'm not sure what the ratio is for higher-end setups, actually--probably disk latency is still higher if not as high.
sync: with client side sync option. Data/metadata is written across NFS and to Server disk before system calls return (all FS system calls ?).
I really don't understand why the async option is implemented on the server export although a sync option here could force sync for all clients for that mount. What am I missing ? Is there some good reason (rather than history) it is done this way ?
So, again, Linux knfsd's "async" export behavior is just incorrect, and I'd be happier if we didn't have to support it.
See above for why I don't think what you describe as async-like behavior would fly.
As for adding protocol to allow the server to tell all clients that they should do "sync" mounts: I don't know, I suppose it's possible, but a) I don't know how much use it would actually get (I suspect "sync" mounts are pretty rare), and b) that's meddling with client implementation behavior a little more than we normally would in the protocol. The difference between "sync" and "async" mounts is purely a matter of client behavior, after all, it's not really visible to the protocol at all.
--b.
On Mon, Feb 05, 2018 at 06:06:29PM -0500, J. Bruce Fields wrote:
Or this?:
https://www.newegg.com/Product/Product.aspx?Item=N82E16820156153&cm_re=s...
Ugh, Anandtech explains that their marketing is misleading, that drive can't actually destage its volatile write cache on power loss:
https://www.anandtech.com/show/8528/micron-m600-128gb-256gb-1tb-ssd-revi +ew-nda-placeholder
I've been trying to figure this out in part because I wondered what I might use if I replaced my home server this year. After some further looking the cheapest PCIe-attached SSD with real power loss protection that I've found is this Intel model a little over $300:
http://www.intel.com/content/www/us/en/products/memory-storage/solid-state-d...
Kinda ridiculous to buy a 450 gig drive mainly so I can put a half-gig journal on it. It might turn out to be best for my case just to RAID a couple of those SSDs and skip the conventional drives completely.
--b.
On Wed, Jan 31, 2018 at 07:34:24PM -0600, Jeremy Linton wrote:
On 01/31/2018 09:49 AM, J. Bruce Fields wrote:
In the kernel compile case there's probably also a lot of re-opening and re-reading files too? NFSv4 is chattier there too. Read delegations should help compensate, but we need to improve the heuristics that decide when they're given out.
The main kernel include files get repeatedly hammered, despite them in theory being in cache, IIRC. So yes, if the concurrent (re)open path is even slightly slower its going to hurt a lot.
All that aside I can't think what would explain that big a difference (45 minutes vs. 5). It might be interesting to figure out what happened.
I had already spent more than my time allotted looking in the wrong direction at the filesystem/RAID (did turn off intellipark though) by the time I discovered the nfsv3/v4 perf delta. Its been sitting way down on the "things to look into" list for a long time now. I'm still using it as a NFS server so at some point I can take another look if the problem persists.
OK, understood.
Well, if you ever want to take another look at the v4 issue--I've been meaning to rework the delegation heuristics. Assuming you're on a recent kernel, I could give you some experimental (but probably not too risky) kernel patches if you didn't mind keeping notes on the results.
I'll probably get around to it eventually on my own, but it'd probably happen sooner with a collaborator.
But the difference you saw was so drastic, there may have just been some unrelated NFSv4 bug.....
--b.
On Tue, Jan 30, 2018 at 07:03:17PM +0000, Terry Barnaby wrote:
It looks like each RPC call takes about 0.5ms. Why do there need to be some many RPC calls for this ? The OPEN call could set the attribs, no need for the later GETATTR or SETATTR calls.
The first SETATTR (which sets ctime and mtime to server's time) seems unnecessary, maybe there's a client bug.
The second looks like tar's fault, strace shows it doing a utimensat() on each file. I don't know why or if that's optional.
Even the CLOSE could be integrated with the WRITE and taking this further OPEN could do OPEN, SETATTR, and some WRITE all in one.
We'd probably need some new protocol to make it safe to return from the open systemcall before we've gotten the OPEN reply from the server.
Write delegations might save us from having to wait for the other operations.
Taking a look at my own setup, I see the same calls taking about 1ms. The drives can't do that, so I've got a problem somewhere too....
--b.
On Tue, Jan 30, 2018 at 04:31:58PM -0500, J. Bruce Fields wrote:
On Tue, Jan 30, 2018 at 07:03:17PM +0000, Terry Barnaby wrote:
It looks like each RPC call takes about 0.5ms. Why do there need to be some many RPC calls for this ? The OPEN call could set the attribs, no need for the later GETATTR or SETATTR calls.
The first SETATTR (which sets ctime and mtime to server's time) seems unnecessary, maybe there's a client bug.
The second looks like tar's fault, strace shows it doing a utimensat() on each file. I don't know why or if that's optional.
Even the CLOSE could be integrated with the WRITE and taking this further OPEN could do OPEN, SETATTR, and some WRITE all in one.
We'd probably need some new protocol to make it safe to return from the open systemcall before we've gotten the OPEN reply from the server.
Write delegations might save us from having to wait for the other operations.
Taking a look at my own setup, I see the same calls taking about 1ms. The drives can't do that, so I've got a problem somewhere too....
Whoops, I totally forgot it was still set up with an external journal on SSD:
# tune2fs -l /dev/mapper/export-export |grep '^Journal' Journal UUID: dc356049-6e2f-4e74-b185-5357bee73a32 Journal device: 0x0803 Journal backup: inode blocks # blkid --uuid dc356049-6e2f-4e74-b185-5357bee73a32 /dev/sda3 # cat /sys/block/sda/device/model INTEL SSDSA2M080
So, most of the data is striped across a couple big hard drives, but the journal is actually on a small partition on an SSD.
If I remember correctly, I initially tried this with an older intel SSD and didn't get a performance improvement. Then I replaced it with this model which has the "Enhanced Power Loss Data Protection" feature, which I believe means the write cache is durable, so it should be able to safely acknowledge writes as soon as they reach the SSD's cache.
And weirdly I think I never actually got around to rerunning these tests after I installed the new SSD.
Anyway, so that might explain the difference we're seeing.
I'm not sure how to find new SSDs with that feature, but it may be worth considering as a cheap way to accelerate this kind of workload. It can be a very small SSD as it only needs to hold the journal. Adding an external journal is a quick operation (you don't have to recreate the filesystem or anything).
--b.
On 30/01/18 21:31, J. Bruce Fields wrote:
On Tue, Jan 30, 2018 at 07:03:17PM +0000, Terry Barnaby wrote:
It looks like each RPC call takes about 0.5ms. Why do there need to be some many RPC calls for this ? The OPEN call could set the attribs, no need for the later GETATTR or SETATTR calls.
The first SETATTR (which sets ctime and mtime to server's time) seems unnecessary, maybe there's a client bug.
The second looks like tar's fault, strace shows it doing a utimensat() on each file. I don't know why or if that's optional.
Even the CLOSE could be integrated with the WRITE and taking this further OPEN could do OPEN, SETATTR, and some WRITE all in one.
We'd probably need some new protocol to make it safe to return from the open systemcall before we've gotten the OPEN reply from the server.
Write delegations might save us from having to wait for the other operations.
Taking a look at my own setup, I see the same calls taking about 1ms. The drives can't do that, so I've got a problem somewhere too....
--b.
Also, on the 0.5ms. Is this effectively the 1ms system tick ie. the NFS processing is not processing based on the packet events (not pre-emptive) but on the next system tick ?
An ICMP ping is about 0.13ms (to and fro) between these systems. Although 0.5ms is relatively fast, I wouldn't have thought it should have to take 0.5ms for a minimal RPC even over TCPIP.
On Tue, Jan 30, 2018 at 10:30:04PM +0000, Terry Barnaby wrote:
Also, on the 0.5ms. Is this effectively the 1ms system tick ie. the NFS processing is not processing based on the packet events (not pre-emptive) but on the next system tick ?
An ICMP ping is about 0.13ms (to and fro) between these systems. Although 0.5ms is relatively fast, I wouldn't have thought it should have to take 0.5ms for a minimal RPC even over TCPIP.
It'd be interesting to break down that latency. I'm not sure where it's coming from. I doubt it has to do with the system tick.
--b.
devel@lists.stg.fedoraproject.org