On 12/26/2011 03:57 AM, Brendan Conoboy wrote:
On 12/25/2011 03:47 AM, Gordan Bobic wrote:
> On 12/25/2011 06:16 AM, Brendan Conoboy wrote:
>> Allocating builders to individual rather than a single raid volume will
>> help dramatically.
> Care to explain why?
Sure, see below.
> Is this a "proper" SAN or just another Linux box with some disks in it?
> Is NFS backed by a SAN "volume"?
As I understand it, the server is a Linux host using raid0 with 512k
chunks across 4 sata drives. This md device is then formatted with some
filesystem (ext4?). Directories on this filesystem are then exported to
individual builders such that each builder has its own private space.
These private directories contain a large file that is used as a
loopback ext4fs (IE, the builder mounts the nfs share, then loopback
mounts the file on that nfs share as an ext4fs). This is where
/var/lib/mock comes from. Just to be clear, if you looked at nfs mounted
directory on a build host you would see a single large file that
represented a filesystem, making traditional ext?fs tuning a bit more
Why not just mount direct via NFS? It'd be a lot quicker, not to mention
easier to tune. It'd work for building all but a handful of packages
(e.g. zsh), but you could handle that by having a single builder that
uses a normal fs that has a policy pointing the packages that fail
self-tests on NFS at it.
The structural complication is that we have something like 30-40
all vying for the attention of those 4 spindles. It's really important
that each builder not cause more than one disk to perform an operation
because seeks are costly, and if just 2 disks get called up by a single
builder, 50% of the storage resources will be taken up by a single host
until the operation completes. With 40 hosts, you'll just end up
thrashing (with considerably fewer hosts, too).. Raid0 gives great
throughput, but it's at the cost of latency. With so many 100mbit
builders, throughput is less important and latency is key.
512KB chunks sound vastly oversized for this sort of a workload. But if
you are running ext4 on top of loopback file on top of NFS, no wonder
the performance sucks.
Roughly put, the two goals for good performance in this scenario
1. Make sure each builder only activates one disk per operation.
Sounds like a better way to ensure that would be to re-architect the
storage solution more sensibly. If you really want to use block level
storage, use iSCSI on top of raw partitions. Providing those partitions
are suitably aligned (e.g. for 4KB physical sector disks, erase block
sizes, underlying RAID, etc.), your FS on top of those iSCSI exports
will also end up being properly aligned, and the stride, stripe-width
and block group size will all still line up properly.
But with 40 builders, each builder only hammering one disk, you'll still
get 10 builders hammering each spindle and causing a purely random seek
pattern. I'd be shocked if you see any measurable improvement from just
splitting up the RAID.
2. Make sure each io operation causes the minimum amount of seeking.
You're right that good alignment and block sizes and whatnot will help
this cause, but there is still greater likelihood of io operations
traversing spindle boundaries periodically in the best situation. You'd
need a chunk size about equal to the fs image file size to pull that
Using the fs image over loopback over NFS sounds so eyewateringly wrong
that I'm just going to give up on this thread if that part is immutable.
I don't think the problem is significantly fixable if that approach remains.
Perhaps an lvm setup with strictly defined layouts with each
lvcreate would make it a bit more manageable, but for simplicity's sake
I advocate simply treating the 4 disks like 4 disks, exported according
to expected usage patterns.
I don't see why you think that seeking within a single disk is any less
problematic than seeking across multiple disks. That will only happen
when the file exceeds the chunk size, and that will typically happen
only at the end when linking - there aren't many cases where a single
code file is bigger than a sensible chunk size (and in a 4-disk RAID0
case, you're pretty much forced to use a 32KB chunk size if you intend
for the block group beginnings to be distributed across spindles).
In the end, if all this is done and the builders are delayed by deep
sleeping nfsds, the only options are to move /var/lib/mock to local
storage or increase the number of spindles on the server.
And local storage will be what? SD cards? There's only one model line of
SD cards I have seen to date that actually produce random-write results
that begin to approach a ~5000 rpm disk (up to 100 IOPS), and those are
SLC and quite expensive. Having spent the last few months patching,
fixing up and rebuilding RHEL6 packages for ARM, I have a pretty good
understanding of what works for backing storage and what doesn't - and
SD cards are not an approach to take if performance is an issue. Even
expensive, highly branded Class 10 SD cards only manage ~ 20 IOPS
(80KB/s) on random writes.
>> Disable fs
>> journaling (normally dangerous, but this is throw-away space).
> Not really dangerous - the only danger is that you might have to wait
> for fsck to do it's thing on an unclean shutdown (which can take hours
> on a full TB scale disk, granted).
I mean dangerous in the sense that if the server goes down, there might
be data loss, but the builders using the space won't know that. This is
particularly true if nfs exports are async.
Strictly speaking, journal is about preventing the integrity of the FS
so you don't have to fsck it after an unclean shutdown, not about
preventing data loss as such. But I guess you could argue the two are
> Build of zsh will break on NFS whateveryou do. It will also break
> local FS with noatime. There may be other packages that suffer from this
> issue but I don't recall them off the top of my head. Anyway, that is an
> issue for a build policy - have one builder using block level storage
> with atime and the rest on NFS.
Since loopback files representing filesystems are being used with nfs as
the storage mechanism, this would probably be a non-issue. You just
can't have the builder mount its loopback fs noatime (hadn't thought of
I'm still not sure what is the point of using a loopback-ed file for
storage instead of raw NFS. NFS mounted with nolock,noatime,proto=udp
works exceedingly well for me with NFSv3.
>> Once all that is done, tweak the number of nfsds such that
>> there are as many as possible without most of them going into deep
>> sleep. Perhaps somebody else can suggest some optimal sysctl and ext4fs
> As mentioned in a previous post, have a look here:
> Deadline scheduler might also help on the NAS/SAN end, plus all the
> usual tweaks (e.g. make sure write caches on the disks are enabled, if
> the disks support write-read-verify disable it, etc.)
Definitely worth testing. Well ordered IO is critical here.
Well, deadline is about favouring reads over writes. Writes you can
buffer as long as you have RAM to spare (expecially with libeatmydata
LD_PRELOAD-ed). Reads, however, block everything until they complete. So
favouring reads over writes may well get you ahead in terms of keeping
he builders busy.