I'm getting significant iowait while writing to a 100GB file. I have already made it nocow by copying it to another directory, marking the director nocow (+C) and using cat <oldfile> <newfile> to re-create it from scratch.
I was under the impression that this should fix the problem.
On a tangent, it took about 30 minutes to delete the old file... My system is a Ryzen 5 3600 w/ 16GB or memory but it is a spinning disk. I use an NVME for the system and the spinning disk for /home.
Currently I'm getting random GUI freezes due to the iowait problem and my HDD indicator light basically stays on solid for over an hour now.
Any tips?
Thanks, Richard
This won't speed up the actual IO but it should reduce the impact on other work.
if you aren't familiar, man sysctl to understand how to apply the below settings.
set these 2: vm.dirty_background_bytes = 3000000 vm.dirty_bytes = 5000000
They will be 0 to start with and these 2 settings will be was was used prior to setting bytes: vm.dirty_background_ratio = 0 vm.dirty_ratio = 0
ratio is % of memory. So 16GB * (dirty_ratio - dirty_background_ratio) / write_rate (guess say 50MB, could be 2x either way) and that is how long once you hit dirty_ratio it takes for the IO to unfreeze when you hit dirty_background_ratio. It takes 2-3 seconds to clear 1% on of 16GB so bigger numbers are much worse.
I set mine so that really I only have 2MB to clear and that will clear before I notice. And overall about all the big write cache does for you is to give you the false sense with smaller writes that they are done when really they are not.
grep -i dirty /proc/meminfo will show you how much you have outstanding, and it will bounce between the 2 settings.
On Tue, Mar 23, 2021 at 9:39 AM Richard Shaw hobbes1069@gmail.com wrote:
I'm getting significant iowait while writing to a 100GB file. I have already made it nocow by copying it to another directory, marking the director nocow (+C) and using cat <oldfile> <newfile> to re-create it from scratch.
I was under the impression that this should fix the problem.
On a tangent, it took about 30 minutes to delete the old file... My system is a Ryzen 5 3600 w/ 16GB or memory but it is a spinning disk. I use an NVME for the system and the spinning disk for /home.
Currently I'm getting random GUI freezes due to the iowait problem and my HDD indicator light basically stays on solid for over an hour now.
Any tips?
Thanks, Richard _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
With Fedora being intended as a desktop platform, why are these settings not the default?
The highest priority for a desktop system is to keep the user experience flowing smoothly, not to maximize disk i/o rates.
Can this be fixed in time for F34? Do we need a bug report?
On 3/23/21 2:26 PM, Roger Heflin wrote:
This won't speed up the actual IO but it should reduce the impact on other work.
if you aren't familiar, man sysctl to understand how to apply the below settings.
set these 2: vm.dirty_background_bytes = 3000000 vm.dirty_bytes = 5000000
They will be 0 to start with and these 2 settings will be was was used prior to setting bytes: vm.dirty_background_ratio = 0 vm.dirty_ratio = 0
ratio is % of memory. So 16GB * (dirty_ratio - dirty_background_ratio) / write_rate (guess say 50MB, could be 2x either way) and that is how long once you hit dirty_ratio it takes for the IO to unfreeze when you hit dirty_background_ratio. It takes 2-3 seconds to clear 1% on of 16GB so bigger numbers are much worse.
I set mine so that really I only have 2MB to clear and that will clear before I notice. And overall about all the big write cache does for you is to give you the false sense with smaller writes that they are done when really they are not.
grep -i dirty /proc/meminfo will show you how much you have outstanding, and it will bounce between the 2 settings.
On Tue, Mar 23, 2021 at 9:39 AM Richard Shaw hobbes1069@gmail.com wrote:
I'm getting significant iowait while writing to a 100GB file. I have already made it nocow by copying it to another directory, marking the director nocow (+C) and using cat <oldfile> <newfile> to re-create it from scratch.
I was under the impression that this should fix the problem.
On a tangent, it took about 30 minutes to delete the old file... My system is a Ryzen 5 3600 w/ 16GB or memory but it is a spinning disk. I use an NVME for the system and the spinning disk for /home.
Currently I'm getting random GUI freezes due to the iowait problem and my HDD indicator light basically stays on solid for over an hour now.
Any tips?
Thanks, Richard _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
On 3/24/21 11:27 AM, John Mellor wrote:
With Fedora being intended as a desktop platform, why are these settings not the default?
Because they are ugly workarounds for something that is broken elsewhere.
Seriously, telling the kernel that it should stop applications attempting to write to files as soon as 0% of the RAM contains dirty buffers can't be considered a sane setting.
Regards.
Well, while it is not a great idea, it is better than what is going to happen if you don't prevent them from writing, or if you let the write buffer get so large going from the high to lower water mark takes too long.
if you never stop the writes then eventually the kernel will oom.
And really about all of the default setting is it initially makes the benchmarks look good. That is until the benchmark adds code to correctly deal with not stopping the time until the write buffer is synced/clear.
It is also not sane to let writes (that in a lot of cases you aren't going to read again soon) force a machine to page because of the high IO on a slow device allowing the write cache to page out pages that on is going to reuse much sooner.
Maybe the system should have some code in the path to only allow so many seconds of outstandings writes for a device. But given all of the possible io paths to devices I don't think that is something that is reasonable for anyone to code.
On Wed, Mar 24, 2021 at 5:37 AM Roberto Ragusa mail@robertoragusa.it wrote:
On 3/24/21 11:27 AM, John Mellor wrote:
With Fedora being intended as a desktop platform, why are these settings not the default?
Because they are ugly workarounds for something that is broken elsewhere.
Seriously, telling the kernel that it should stop applications attempting to write to files as soon as 0% of the RAM contains dirty buffers can't be considered a sane setting.
Regards.
-- Roberto Ragusa mail at robertoragusa.it _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
On Wed, Mar 24, 2021 at 4:29 AM John Mellor john.mellor@gmail.com wrote:
With Fedora being intended as a desktop platform, why are these settings not the default?
The highest priority for a desktop system is to keep the user experience flowing smoothly, not to maximize disk i/o rates.
Can this be fixed in time for F34? Do we need a bug report?
Post to devel@ or desktop@ lists, is my advice. If you convince Workstation edition folks, it'll certainly get more attention on devel@ once it's a feature/change request.
It might be appropriate to set dirty_bytes to 500M across the board, desktop and server. And dirty_background to 1/4 that. But all of these are kinda rudimentary guides. What we really want is something that knows what the throughput of the storage is, and is making sure there isn't more than a few seconds of writeback needed at any given time.
The default, dirty_ratio 20%, is high by today's memory standards. But upstream will not change it. All kernel knobs are distro responsibility to change from the defaults.
I'm the first person to poo poo benchmarks. But if you run all of them, it should be pretty easy to show whether it's generally useful, or generally not useful (i.e. either bad or ambiguous) to change it.
On 3/25/21 4:25 AM, Chris Murphy wrote:
It might be appropriate to set dirty_bytes to 500M across the board, desktop and server. And dirty_background to 1/4 that. But all of these are kinda rudimentary guides. What we really want is something that knows what the throughput of the storage is, and is making sure there isn't more than a few seconds of writeback needed at any given time.
The default, dirty_ratio 20%, is high by today's memory standards. But upstream will not change it. All kernel knobs are distro responsibility to change from the defaults.
I don't agree with the base reasoning. There is nothing wrong in having many gigabytes of dirty data in memory, if the machine has enough RAM to do it. It is one of the things that make the difference between Linux and toy systems. "500M will be enough" sound like the historical "640k will be enough", because 500M could be flushed in a fraction of a second on modern SSDs.
What you really want is that if there are 40GB of outstanding data going to the disk, processes are still: 1) able to write to the disks without heavy latency (and delaying it in memory is exactly achieving that) 2) able to read the disks without heavy latency, which is something the disk scheduling code will care to provide (reads have priority over writes).
The kernel has even got per-device queues to avoid a slow USB drive to stall the I/O for the other devices.
If the filesystem is not able to read 50kB because there are 10GB of dirty data in memory, the problem is in the filesystem code.
Regards.
On Thu, Mar 25, 2021 at 8:59 AM Roberto Ragusa mail@robertoragusa.it wrote:
On 3/25/21 4:25 AM, Chris Murphy wrote:
It might be appropriate to set dirty_bytes to 500M across the board, desktop and server. And dirty_background to 1/4 that. But all of these are kinda rudimentary guides. What we really want is something that knows what the throughput of the storage is, and is making sure there isn't more than a few seconds of writeback needed at any given time.
The default, dirty_ratio 20%, is high by today's memory standards. But upstream will not change it. All kernel knobs are distro responsibility to change from the defaults.
I don't agree with the base reasoning. There is nothing wrong in having many gigabytes of dirty data in memory, if the machine has enough RAM to do it. It is one of the things that make the difference between Linux and toy systems.
The problem is well understood for some time. https://lwn.net/Articles/572911/
"500M will be enough" sound like the historical "640k will be enough", because 500M could be flushed in a fraction of a second on modern SSDs.
Which would be an Ok combination. What you don't want is 1G of dirty data accumulating before your 80 M/s drive starts writeback. If the process fsync's a file, now you've got a blocked task that must write out all dirty data right now, and the whole storage stack down to the drive is not going to easily stop doing that just because some other program fsync's a 100K cache file. There will be a delay.
Now, if this delay causes that program to stall from the user's perspective, is that well behaved? I mean come on, we all know web browser cache files are 100% throw away garbage, they're there to make things faster, not to cause problems and yet here we are.
There's enough misuse of fsync by applications that there's a utility to cause fsync to be dropped. https://www.flamingspork.com/projects/libeatmydata/
In fact some folks run their web browser in a qemu-kvm with cache mode "unsafe" to drop all the fsyncs.
What you really want is that if there are 40GB of outstanding data going to the disk, processes are still:
- able to write to the disks without heavy latency (and delaying it in
memory is exactly achieving that) 2) able to read the disks without heavy latency, which is something the disk scheduling code will care to provide (reads have priority over writes).
If you have 40G of dirty data and your program says "fsync it" you've got 40G of data that has been ordered flushed to stable media. Everything else wanting access is going to come close to stopping. That's the way it works. You don't get to "fsync this very important thing..but oh yeah wait I wanna read a b.s. chrome cache file hold on a sec. Ok thanks, now please continue syncing."
That's in effect something that multiqueue NVMe can do. So there's a work around.
The kernel has even got per-device queues to avoid a slow USB drive to stall the I/O for the other devices.
If the filesystem is not able to read 50kB because there are 10GB of dirty data in memory, the problem is in the filesystem code.
The defaults are crazy. https://lwn.net/Articles/572921/
Does this really make a difference though outside the slow USB stick example? I don't know. Seems like it won't for fsync heavy handedness because that'll take precedence.
On Thu, Mar 25, 2021 at 7:26 PM Chris Murphy lists@colorremedies.com wrote:
The defaults are crazy. https://lwn.net/Articles/572921/
Does this really make a difference though outside the slow USB stick example? I don't know. Seems like it won't for fsync heavy handedness because that'll take precedence.
There's more about this in that same 8 year old thread. https://lore.kernel.org/linux-mm/20131111032211.GT6188@dastard/
I wonder what the state of the implied work is now, and whether the applications should be making better use of fadvise to inform the kernel of its intentions/expectations with respect to the files it's writing.
On 3/26/21 2:26 AM, Chris Murphy wrote:
If you have 40G of dirty data and your program says "fsync it" you've got 40G of data that has been ordered flushed to stable media. Everything else wanting access is going to come close to stopping. That's the way it works. You don't get to "fsync this very important thing..but oh yeah wait I wanna read a b.s. chrome cache file hold on a sec. Ok thanks, now please continue syncing."
That's in effect something that multiqueue NVMe can do. So there's a work around.
Well, there is no reason for fsync to block everything else. The meaning of fsync is that process is telling "I will not proceed until you tell me this file has reached the disk", and that is a hint to the kernel to begin writing with the objective to let the process get unstuck. Indeed fsync doesn't mean "hey, I am in emergency mode, stop everything else because my stuff is important". So a good filesystem on a good kernel will correctly apply priorities, fairness etc. to let other processes do their I/O. You are not irreversibly queueing 40G to the drive, the drive is going to get small operations (e.g. 1000 blocks) and there is a chance for the kernel to insert other I/O in the flow. But there are two issues to consider: 1) there could be huge "irreversible" queues somewhere; this problem is similar to bufferbloat for network packets, but I do not think I/O suffers too much, considering there are no intermediate nodes in the middle 2) there must not be shortcomings in the filesystem code; for example, ext3 ordered mode was flushing everything when asked to flush a 1kB file; I don't know about ext4, I don't know about btrfs
In summary: - if a process calls fsync and then complains about having to wait to get unblocked, it is just creating its own problem (are you doing fsync of big things in your UI thread?) - if a process gets heavily delayed because another process is doing fsync the kernel is not doing its job in terms of fairness
I can agree that reality may not be ideal, but I don't like this attitude of "dropping the ball" by disabling caching here and there because (provocative paradox) web browser authors are using fsync for bookmarks and cookies DB in the UI thread.
NOTE: I know about eatmydata, I've used it sometimes. There is also this nice trick for programs stupidly doing too many fsync: system-nspawn --system-call-filter='~sync:0 fsync:0'
Regards.
On Fri, Mar 26, 2021 at 4:00 PM Roberto Ragusa mail@robertoragusa.it wrote:
Well, there is no reason for fsync to block everything else.
In practice, it does. There's only one thing happening at a time with with a HDD, so while the write is flushing, nothing else is going to get either a read or a write in, which is why it's not a great idea to build up a lot of dirty data and hit it with fsync, and do that all day long unless you're a server whose task it is to do that particular workload. Mixed workloads are much harder and that's what we have on the desktop.
The meaning of fsync is that process is telling "I will not proceed until you tell me this file has reached the disk", and that is a hint to the kernel to begin writing with the objective to let the process get unstuck.
And what if you have two callers of fsync? If the first one has 10 seconds of writeback to do on fsync, what happens to the fsync of another caller? It's going to have to wait 10 seconds *plus* the time for its own writeback.
This is why you want maybe a second or two of writeback, and programs that aren't wrecklessly hammering their files with fsync just because they think that's the only way they're ever going to get on disk.
Indeed fsync doesn't mean "hey, I am in emergency mode, stop everything else because my stuff is important".
If you have multiple aggressive writers calling any kind of sync, you have the potential for contention. When concurrent writes and sync's happen through a single point in time writer, what should happen? Btrfs can actually do a better job of this because it'll tend to aggregate those random writes into sequential writes, interleaving them. The problem with that interleaving comes at read time, because now there's fragmentation. The fragmentation problem is sometimes overstated because contiguity isn't as important as proximity, you can have nearby blocks resulting in low read latency even if they aren't contiguous and quite a lot of engineering has gone into making drives and drive controllers do that work. But it can't overcome significantly different placement of a file's blocks. If all of them need to be read and they're far apart due to prior interleaved writes, you'll see seek latency go up.
There's no free lunch, there's tradeoffs for everything. There's a reason for 15k rpm hard drives after all.
So a good filesystem on a good kernel will correctly apply priorities, fairness etc. to let other processes do their I/O.
They can do their own buffered writes while fsync is happening. Concurrent fsyncs means competition. If the data being fsync'd is small, then it's not likely to get noticed until you have 3 or more contenders (for spinning media) to get to around 150ms of latency to be noticed by a person. Noticed, not necessarily annoyed. But if you have even one process producing a lot of anonymous pages and fsyncing frequently, you're going to see tens of seconds of contention for a device that cannot do simultaneous writes and will be very reluctant to
You are not irreversibly queueing 40G to the drive, the drive is going to get small operations (e.g. 1000 blocks) and there is a chance for the kernel to insert other I/O in the flow.
Sure and what takes 5 seconds as a dedicated fsync now becomes 20 seconds when you add a bunch of reads to it, and both the reads and writes will be noticeably slower than they were when they weren't in contention.
This is why multiqueue low latency drives are vastly better for mixed workloads.
But there are two issues to consider:
- there could be huge "irreversible" queues somewhere; this problem
is similar to bufferbloat for network packets, but I do not think I/O suffers too much, considering there are no intermediate nodes in the middle 2) there must not be shortcomings in the filesystem code; for example, ext3 ordered mode was flushing everything when asked to flush a 1kB file; I don't know about ext4, I don't know about btrfs
Btrfs flushes just the files in the directory being fsync'd, but there can be contention on the tree log which is used to make fsync's performant. There is a tree log per subvolume. I doubt that separating the two workloads into separate subvolumes will help in this case, it doesn't sound like either workload is really that significant but I don't know what's going on so it could be worth a try. But I'd say if you're fiddling with things on this level that it's important to be really rigorous and only apply one change at a time, otherwise it's impossible to know what made things better or worse.
Note that subvolumes while mostly like directories, they are separate namespaces with separate file descriptors, stat will show them as different devices, they have their own pool of inodes. You can't create hardlinks across subvolumes (you can create reflinks across subvolumes, but there is a VFS limitation that enforces no reflinks the cross mount points.)
In summary:
- if a process calls fsync and then complains about having to wait
to get unblocked, it is just creating its own problem (are you doing fsync of big things in your UI thread?)
- if a process gets heavily delayed because another process is doing
fsync the kernel is not doing its job in terms of fairness
It very much depends on the workload. And there's also cgroup io.latency to consider as well. The desktop is in a better position to make decisions on what's more important: UI/UX responsiveness is often more important than performance.
I can agree that reality may not be ideal, but I don't like this attitude of "dropping the ball" by disabling caching here and there because (provocative paradox) web browser authors are using fsync for bookmarks and cookies DB in the UI thread.
No one has suggested disabling caching. Reducing the *time* to start write back is what was suggested and we don't even know if that matters anymore, and I wasn't even the one who first suggested it, goes back to Linus saying the defaults are crazy 10 years ago.
NOTE: I know about eatmydata, I've used it sometimes. There is also this nice trick for programs stupidly doing too many fsync: system-nspawn --system-call-filter='~sync:0 fsync:0'
That is awesome! Way easier to deal with than eatmydata.
A little thread necro... I stopped the blockchain daemon for a while and recently restarted it and am now seeing GUI freezes while it resyncs even though the file itself is marked +C...
Here's the output of the requested commands:
Thanks, Richard
I don't know why but the spinning disk is being crushed.
if you divide the mb/sec by the reads you get around 4k per read (that is about as bad as you could do). if you multiply the reads/sec * r_await you get all of the time accounted for.
And since each read is taking around 8-10ms (around the disks seek time for a new track) then each block being read is not being cached in the disk hence probably not on the same track that the disk just read or as the disk recently read and still has in its cache. If the file you are rsyncing was written slowly or quickly with a number of other IO's happening between each io then that increases the chances of the file being massively fragmented and act like this.
Is this a single file and this single rsync is the only thing running?
And what disk are you syncing to/from? And how was the file that you are rsyncing created? I have seen a DB do this (in a sequential backup) and that file was created in such a way that for the most part no 2 blocks were next to each other on a disk. I believe the average blocksize for that file was 5.1k. the filefrag command will show file fragments on some filesystems. There may be different commands needed depending on what filesystem the file comes from.
On Fri, Apr 30, 2021 at 5:42 PM Richard Shaw hobbes1069@gmail.com wrote:
A little thread necro... I stopped the blockchain daemon for a while and recently restarted it and am now seeing GUI freezes while it resyncs even though the file itself is marked +C...
Here's the output of the requested commands:
Thanks, Richard _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
On Fri, Apr 30, 2021 at 6:16 PM Roger Heflin rogerheflin@gmail.com wrote:
I don't know why but the spinning disk is being crushed.
Well, a little googling after my post it appears the database is LMDB, which is a COW db. So I can see how a COW DB on top of a COW FS may be a problem, but I have marked the directory nodatacow...
$ lsattr ~/.bitmonero/lmdb -------------------- /home/richard/.bitmonero/lmdb/lock.mdb ---------------C---- /home/richard/.bitmonero/lmdb/data.mdb
if you divide the mb/sec by the reads you get around 4k per read
(that is about as bad as you could do). if you multiply the reads/sec * r_await you get all of the time accounted for.
And since each read is taking around 8-10ms (around the disks seek time for a new track) then each block being read is not being cached in the disk hence probably not on the same track that the disk just read or as the disk recently read and still has in its cache. If the file you are rsyncing was written slowly or quickly with a number of other IO's happening between each io then that increases the chances of the file being massively fragmented and act like this.
Is this a single file and this single rsync is the only thing running?
I down know the inner workings of the client, but it is a single file as seen above.
And what disk are you syncing to/from? And how was the file that
you are rsyncing created? I have seen a DB do this (in a sequential backup) and that file was created in such a way that for the most part no 2 blocks were next to each other on a disk. I believe the average blocksize for that file was 5.1k. the filefrag command will show file fragments on some filesystems. There may be different commands needed depending on what filesystem the file comes from.
$ filefrag ~/.bitmonero/lmdb/data.mdb /home/richard/.bitmonero/lmdb/data.mdb: 388217 extents found
Thanks, Richard
388217 * 10ms = about 3800 seconds to read that file or about 26MB/sec, but with all of the seeks most of that time will be idle time waiting on disk (iowait), and it is very possible that parts of the file have large extents and other parts of the file are horribly fragmented. And that ignores any time to do any other work related to the rsync and file io. How long does it take to copy the file?
btrfs has an autodefrag mount option, no idea how good or bad it works, but it might be able to reduce the extents given enough time to a reasonable number and keep it under control.
so long as you are using rsync to read the file the fact that the db is cow is probably not an issue (since from rsync's point of view it is just one big file). if you have small writes to the file and btrfs was set to cow that would make a mess. Not sure for a db btrfs is a good filesystem choice on a spinning disk, disabling cow might have mostly fixed this. it not clear to me that if you set the defrag option now if it will fix the already fragmented parts of the file or not. And if you turned off cow later the file may have already been heavily fragemented.
On Fri, Apr 30, 2021 at 7:20 PM Richard Shaw hobbes1069@gmail.com wrote:
On Fri, Apr 30, 2021 at 6:16 PM Roger Heflin rogerheflin@gmail.com wrote:
I don't know why but the spinning disk is being crushed.
Well, a little googling after my post it appears the database is LMDB, which is a COW db. So I can see how a COW DB on top of a COW FS may be a problem, but I have marked the directory nodatacow...
$ lsattr ~/.bitmonero/lmdb -------------------- /home/richard/.bitmonero/lmdb/lock.mdb ---------------C---- /home/richard/.bitmonero/lmdb/data.mdb
if you divide the mb/sec by the reads you get around 4k per read (that is about as bad as you could do). if you multiply the reads/sec * r_await you get all of the time accounted for.
And since each read is taking around 8-10ms (around the disks seek time for a new track) then each block being read is not being cached in the disk hence probably not on the same track that the disk just read or as the disk recently read and still has in its cache. If the file you are rsyncing was written slowly or quickly with a number of other IO's happening between each io then that increases the chances of the file being massively fragmented and act like this.
Is this a single file and this single rsync is the only thing running?
I down know the inner workings of the client, but it is a single file as seen above.
And what disk are you syncing to/from? And how was the file that you are rsyncing created? I have seen a DB do this (in a sequential backup) and that file was created in such a way that for the most part no 2 blocks were next to each other on a disk. I believe the average blocksize for that file was 5.1k. the filefrag command will show file fragments on some filesystems. There may be different commands needed depending on what filesystem the file comes from.
$ filefrag ~/.bitmonero/lmdb/data.mdb /home/richard/.bitmonero/lmdb/data.mdb: 388217 extents found
Thanks, Richard _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
On Fri, Apr 30, 2021 at 7:56 PM Roger Heflin rogerheflin@gmail.com wrote:
388217 * 10ms = about 3800 seconds to read that file or about 26MB/sec, but with all of the seeks most of that time will be idle time waiting on disk (iowait), and it is very possible that parts of the file have large extents and other parts of the file are horribly fragmented. And that ignores any time to do any other work related to the rsync and file io. How long does it take to copy the file?
Not sure, I don't make a habit of moving 100GB files around too often :)
btrfs has an autodefrag mount option, no idea how good or bad it works, but it might be able to reduce the extents given enough time to a reasonable number and keep it under control.
Doing a forced defragment now specifically on the file and it's taking a while.
so long as you are using rsync to read the file the fact that the db
is cow is probably not an issue (since from rsync's point of view it is just one big file). if you have small writes to the file and btrfs was set to cow that would make a mess. Not sure for a db btrfs is a good filesystem choice on a spinning disk, disabling cow might have mostly fixed this. it not clear to me that if you set the defrag option now if it will fix the already fragmented parts of the file or not. And if you turned off cow later the file may have already been heavily fragemented.
I'm not specifically accessing the file, the Monero daemon is so no rsync involved as far as I know. I have already set the directory containing the file as nodatacow and moved the file to ensure it was set previously.
Thanks, Richard
On Thu, Mar 25, 2021 at 7:26 PM Chris Murphy lists@colorremedies.com wrote:
The problem is well understood for some time. https://lwn.net/Articles/572911/
This is an update on that 8 year old story. Last year writebehind patches were proposed, and a discussion ensued.
https://lore.kernel.org/linux-mm/CAHk-=whf2BQ8xqVBF8YuxRznByrP-oTgcHSY9DgDnr...
I don't think an fsync/fdatasync centric workload is going to be affected by these knobs. And it'd take testing to find out if they affect anything we care about, but it's interesting that changing them could improve real world performance while negatively impacting synthetic benchmarking.
On Tue, Mar 23, 2021 at 8:39 AM Richard Shaw hobbes1069@gmail.com wrote:
I'm getting significant iowait while writing to a 100GB file.
High iowait means the system is under load and not CPU bound but IO bound. It sounds like the drive is writing as fast as it can. What's the workload? Reproduce the GUI stalls and capture all of the following:
sudo iostat -x -d -m 5
This is part of sysstat package (you can disable the service and timer units it installs). Probably best to copy/paste into a plaint text file and put it up in a file share service, most anything else is going to wrap it, making it hard to read. A minute of capture while the workload is proceeding is enough. Also capture a few
grep -R . /proc/pressure
And each of these (workload doesn't need to be running)
lsblk -o NAME,FSTYPE,SIZE,FSUSE%,MOUNTPOINT,UUID,MIN-IO,SCHED,DISC-GRAN,MODEL uname -r mount | grep btrfs
I have already made it nocow by copying it to another directory, marking the director nocow (+C) and using cat <oldfile> <newfile> to re-create it from scratch.
I was under the impression that this should fix the problem.
It depends on the workload for this file. Was the 100G file fallocated or created as a sparse file? File format?
On a tangent, it took about 30 minutes to delete the old file... My system is a Ryzen 5 3600 w/ 16GB or memory but it is a spinning disk. I use an NVME for the system and the spinning disk for /home.
filefrag 100G.file What's the path to the file?
Currently I'm getting random GUI freezes due to the iowait problem and my HDD indicator light basically stays on solid for over an hour now.
Have sysrq+t ready in a shell but don't issue it. Reproduce this problem (the GUI freezes) and then issue the sysrq+t. Depending on how many processes, this could exceed both the kernel message buffer and the journald rate limiter. Either use log-buf-len=8M boot parameter, and then dmesg will have the whole sysrq+t. The other option is temporarily turn off journald rate limiting in journald.conf
#RateLimitIntervalSec=30s #RateLimitBurst=10000
Add a 0 should work. Restart journald. Issue sysrq+t. Output to a file by 'journalctl -k -o short-monotonic --no-hostname > journal.log'
I suggest opening a bug against the kernel, and post the URL here so I can tag it. Attach the iostat output and dmesg/journalctl output as files to that bug, and everything else can just go in the description.
Also note any other customizations to /proc or /sys that differ from Fedora defaults.
On Tue, Mar 23, 2021 at 7:11 PM Chris Murphy lists@colorremedies.com wrote:
On Tue, Mar 23, 2021 at 8:39 AM Richard Shaw hobbes1069@gmail.com wrote:
I'm getting significant iowait while writing to a 100GB file.
High iowait means the system is under load and not CPU bound but IO bound. It sounds like the drive is writing as fast as it can. What's the workload?
I was syncing a 100GB blockchain, which means it was frequently getting appended to, so COW was really killing my I/O (iowait > 50%) but I had hoped that marking as nodatacow would be a 100% fix, however iowait would be quite low but jump up on a regular basis to 25%-50% occasionally locking up the GUI briefly. It was worst when the blockchain was syncing and I was rm the old COW version even after rm returned. I assume there was quite a bit of background tasks that were still updating.
Reproduce the GUI stalls and capture all of the
following:
sudo iostat -x -d -m 5
This is part of sysstat package (you can disable the service and timer units it installs). Probably best to copy/paste into a plaint text file and put it up in a file share service, most anything else is going to wrap it, making it hard to read. A minute of capture while the workload is proceeding is enough. Also capture a few
grep -R . /proc/pressure
Unfortunately it's now fully synced so I can easily reproduce the workload. I could move the file to another directory and start over but then then the file would start at 0 bytes and it takes a couple to three days for things to sync.
And each of these (workload doesn't need to be running)
lsblk -o NAME,FSTYPE,SIZE,FSUSE%,MOUNTPOINT,UUID,MIN-IO,SCHED,DISC-GRAN,MODEL uname -r mount | grep btrfs
$ lsblk -o NAME,FSTYPE,SIZE,FSUSE%,MOUNTPOINT,UUID,MIN-IO,SCHED,DISC-GRAN,MODEL NAME FSTYPE SIZE FSUSE% MOUNTPOINT UUID MIN-IO SCHED DISC-GRAN MODEL sda 2.7T 4096 bfq 0B ST3000DM008-2DM166 └─sda1 btrfs 2.7T 27% /home e80829f3-3dd3-486d-a553-dcf54b384c80 4096 bfq 0B sr0 1024M 512 bfq 0B HL-DT-ST_BD-RE_WH14NS40 zram0 4G [SWAP] 4096 4K nvme0n1 465.8G 512 none 512B Samsung SSD 970 EVO Plus 500GB ├─nvme0n1p1 vfat 600M 3% /boot/efi 98D5-E8CE 512 none 512B ├─nvme0n1p2 ext4 1G 28% /boot 48295095-3e89-4d32-905f-bbffcd2051ff 512 none 512B └─nvme0n1p3 btrfs 464.2G 4% /var eca99700-77f4-44ea-b8d5-26673abc4d65 512 none 512B
I have already made it nocow by copying it to another directory, marking
the director nocow (+C) and using cat <oldfile> <newfile> to re-create it from scratch.
I was under the impression that this should fix the problem.
It depends on the workload for this file. Was the 100G file fallocated or created as a sparse file? File format?
I assume for a blockchain, starts small and just grows / appended to.
On a tangent, it took about 30 minutes to delete the old file... My
system is a Ryzen 5 3600 w/ 16GB or memory but it is a spinning disk. I use an NVME for the system and the spinning disk for /home.
filefrag 100G.file What's the path to the file?
$ filefrag /home/richard/.bitmonero/lmdb/data.mdb /home/richard/.bitmonero/lmdb/data.mdb: 1424 extents found
However, I let a rebalance run overnight.
Thanks, Richard
On Wed, Mar 24, 2021 at 6:09 AM Richard Shaw hobbes1069@gmail.com wrote:
I was syncing a 100GB blockchain, which means it was frequently getting appended to, so COW was really killing my I/O (iowait > 50%) but I had hoped that marking as nodatacow would be a 100% fix, however iowait would be quite low but jump up on a regular basis to 25%-50% occasionally locking up the GUI briefly. It was worst when the blockchain was syncing and I was rm the old COW version even after rm returned. I assume there was quite a bit of background tasks that were still updating.
I assume for a blockchain, starts small and just grows / appended to.
Append writes are the same on overwriting and cow file systems. You might get slightly higher iowait because datacow means datasum which means more metadata to write. But that's it. There's no data to COW if it's just appending to a file. And metadata writes are always COW.
You could install bcc-tools and run btrfsslower with the same (exclusive) workload with datacow and nodatacow to see if latency is meaningfully higher with datacow but I don't expect that this is a factor.
iowait just means the CPU is idle waiting for IO to complete. It could do other things, even IO, if that IO can be preempted by proper scheduling. So the GUI freezes are probably because there's some other file on /home, along with this 100G file, that needs to be accessed and between the kernel scheduler, the file system, the IO scheduler, and the drive, it's just reluctant to go do that IO. Again, bcc-tools can help here in the form of fileslower, which will show latency spikes regardless of the file system (it's at the VFS layer and thus closer to the application layer which is where the GUI stalls will happen).
Any way this workload can be described in sufficient detail that anyone can reproduce the setup, can help make it possible for multiple other people trying to collect the information we'd need to track down what's going on. And that also includes A/B testing, such as the exact same setup but merely running the 100G (presumably it is not actually the exact size but the workload as the sync is happening)
Also the more we can take this from the specific case to the general case, including using generic tools like xfs_io instead of a blockchain program, the more attention we can give it because people don't have to learn app specific things. And we can apply the fix to all similar workloads.
On a tangent, it took about 30 minutes to delete the old file... My system is a Ryzen 5 3600 w/ 16GB or memory but it is a spinning disk. I use an NVME for the system and the spinning disk for /home.
filefrag 100G.file What's the path to the file?
$ filefrag /home/richard/.bitmonero/lmdb/data.mdb /home/richard/.bitmonero/lmdb/data.mdb: 1424 extents found
Just today I deleted a 100G Windows 10 raw file with over 6000 extents and it deleted in 3 seconds. So I'm not sure why the delay in your case. So more information is needed, I'm not sure what to use in this case, maybe btrfsslower while also stracing the rm. There is only one ioctl, unlinkat(), and it does need to exit before rm will return to a prompt. But unlinkat() does not imply sync, so it's not necessary for btrfs to write the metadata change unless something else has issued fsync on the enclosing directory, maybe. In that case the command would hang until all the dirty metadata as a result of the delete is updated. And btrfsslower will show this.
However, I let a rebalance run overnight.
It shouldn't be necessary to run balance. If you've hit ENOSPC, it's a bug and needs to be reported. And a separate thread can be started on balance if folks want more info on balance, maintenance, ENOSPC things. I don't ever worry about them anymore. Not since ticketed ENOSPC infrastructure landed circa 2016 in kernel ~4.8.
On Wed, Mar 24, 2021 at 11:05 PM Chris Murphy lists@colorremedies.com wrote:
On Wed, Mar 24, 2021 at 6:09 AM Richard Shaw hobbes1069@gmail.com wrote:
I was syncing a 100GB blockchain, which means it was frequently getting
appended to, so COW was really killing my I/O (iowait > 50%) but I had hoped that marking as nodatacow would be a 100% fix, however iowait would be quite low but jump up on a regular basis to 25%-50% occasionally locking up the GUI briefly. It was worst when the blockchain was syncing and I was rm the old COW version even after rm returned. I assume there was quite a bit of background tasks that were still updating.
I assume for a blockchain, starts small and just grows / appended to.
Append writes are the same on overwriting and cow file systems. You might get slightly higher iowait because datacow means datasum which means more metadata to write. But that's it. There's no data to COW if it's just appending to a file. And metadata writes are always COW.
Hmm... While still annoying (chrome locking up because it can't read/write to it's cache in my /home) my desk chair benchmarking says that it was definitely better as nodatacow. Now that I think about it, initial syncing I'm likely getting the blocks out of order which would explain things a bit more. I'm not too worried about nodatasum for this file as the nature of the blockchain is to be able to detect errors (intentional or accidental) already and should be self correcting.
You could install bcc-tools and run btrfsslower with the same
(exclusive) workload with datacow and nodatacow to see if latency is meaningfully higher with datacow but I don't expect that this is a factor.
That's an interesting tool. So I don't want to post all of it here as it could have some private info in it but I'd be willing to share it privately.
One interesting output now is the blockchain file is almost constantly getting written to but since it's synced, it's only getting appended to (my guess) and I'm not noticing any "chair benchmark" issues but one of the writes did take 1.8s while most of them were a few hundred ms or less.
iowait just means the CPU is idle waiting for IO to complete. It could
do other things, even IO, if that IO can be preempted by proper scheduling. So the GUI freezes are probably because there's some other file on /home, along with this 100G file, that needs to be accessed and between the kernel scheduler, the file system, the IO scheduler, and the drive, it's just reluctant to go do that IO. Again, bcc-tools can help here in the form of fileslower, which will show latency spikes regardless of the file system (it's at the VFS layer and thus closer to the application layer which is where the GUI stalls will happen).
I'm pretty sure that's exactly what's happening. But is there a better I/O scheduler for traditional hard disks, currently I have:
$ cat /sys/block/sda/queue/scheduler mq-deadline kyber [bfq] none
Any way this workload can be described in sufficient detail that anyone can reproduce the setup, can help make it possible for multiple other people trying to collect the information we'd need to track down what's going on. And that also includes A/B testing, such as the exact same setup but merely running the 100G (presumably it is not actually the exact size but the workload as the sync is happening)
I was rounding slightly, so yes not exactly 100GB but as the nature of a blockchain it keeps growing:
$ ls -sh data.mdb 101G data.mdb
A large bittorrent download should also be similar since you don't get the parts in order, but perhaps it's smart enough to allocate all the space on the front end?
Thanks, Richard
On Thu, Mar 25, 2021 at 6:39 AM Richard Shaw hobbes1069@gmail.com wrote:
On Wed, Mar 24, 2021 at 11:05 PM Chris Murphy lists@colorremedies.com wrote:
Append writes are the same on overwriting and cow file systems. You might get slightly higher iowait because datacow means datasum which means more metadata to write. But that's it. There's no data to COW if it's just appending to a file. And metadata writes are always COW.
Hmm... While still annoying (chrome locking up because it can't read/write to it's cache in my /home) my desk chair benchmarking says that it was definitely better as nodatacow. Now that I think about it, initial syncing I'm likely getting the blocks out of order which would explain things a bit more. I'm not too worried about nodatasum for this file as the nature of the blockchain is to be able to detect errors (intentional or accidental) already and should be self correcting.
Is this information in a database? What kind? There are cow friendly databases (e.g. rocksdb, sqlite with WAL enabled), and comparatively cow unfriendly ones - so it may be that setting the file to nodatacow helps. If there's also multiple sources of frequent syncing, this can also exacerbate things.
You can attach strace to both processes and see if either or both of them are doing sync() of any kind, and what interval. I'm not certain whether bcc-tools biosnoop shows all kinds of sync. It'd probably be useful to know both what ioctl is being used by the two programs (chrome and whatever is writing to the large file), as well as their concurrent effect on bios using biosnoop.
You could install bcc-tools and run btrfsslower with the same (exclusive) workload with datacow and nodatacow to see if latency is meaningfully higher with datacow but I don't expect that this is a factor.
That's an interesting tool. So I don't want to post all of it here as it could have some private info in it but I'd be willing to share it privately.
TIME(s) COMM PID DISK T SECTOR BYTES LAT(ms)
There's no content displaced in any case.
One interesting output now is the blockchain file is almost constantly getting written to but since it's synced, it's only getting appended to (my guess) and I'm not noticing any "chair benchmark" issues but one of the writes did take 1.8s while most of them were a few hundred ms or less.
1.8s is quite a lot of latency. It could be the result of a flush delay due to a lot of dirty data, and while that flush is happening it's not going to be easily or quickly preempted by some other process demanding its data be written right now. Btrfs is quite adept at taking multiple write streams from many processes and merging the writes into sequential writes. Even when the writes are random, Btrfs tends to make them sequential. This is thwarted by sync() which is a demand to write a specific file's outstanding data and metadata right now. It sets up all kinds of seek behavior as the data must be written, then the metadata, then the super block.
I'm pretty sure that's exactly what's happening. But is there a better I/O scheduler for traditional hard disks, currently I have:
$ cat /sys/block/sda/queue/scheduler mq-deadline kyber [bfq] none
I don't know anything about the workload still so I'm only able to speculate. Bfq is biased toward reads and is targeted at the desktop use case. mq-deadline is biased toward writes and is targeted at server use case. This is perhaps more server-like in that the chrome writes, like firefox, are sqlite databases. Firefox enables WAL, but I don't see that Chrome does (not sure).
You could try mq-deadline.
$ ls -sh data.mdb 101G data.mdb
A large bittorrent download should also be similar since you don't get the parts in order, but perhaps it's smart enough to allocate all the space on the front end?
That's up to the application that owns the file and is writing to it. There's going to be a seek hit no matter what because they're both written and read out of order. And while they might be database files, they aren't active databases - different write pattern.
users@lists.stg.fedoraproject.org