On Tue, 20 Mar 2012 21:38:13 -0500 Dennis Gilmore dennis@ausil.us wrote:
...snip...
probably we would be adding 100-300 systems. not only do we need to consider overloading of puppet, but also logging and monitoring. I guess its more how do we scale our infrastructure from at a guess ~100 nodes today to 3 to 4 times that
Yeah.
...snip...
im ok with that, im pretty sure fas will scale to the extra boxes. do we drop monitoring of the builders? what about collectd etc.
There's a few things we could do on fas load:
a) add more fas servers. b) reduce the number of runs. How often do we change someone in sysadmin-noc, sysadmin-main, sysadmin-build? c) move to a system where we only re-run fasClient when there is a change.
I'd agree collectd off probibly. Or at least a seperate one if we needed to monitor them.
main issue is that today we are not 100% sure of how we will install arm boxes. how do we deal with all the non puppet related systems? also need to look into how we can better scale koji itself. when we go from 20 to 200+ builders we need to make sure that load doesn't cause koji to fall over.
yeah.
all the arm boxes will have management consoles. but today im not 100% sure how access to that would be. we would also need to deploy fedora for any arm based systems. things we need to reconsider also is networking today the storage network and the builder networks are /24's so we could use 253 nodes. i suspect we will go over that on the build network. we could not have the storage network on arm builders. it is really only needed for createrepo. but we may need to look at expanding kojipkgs to more nodes. or increase its network throughput with multiple bonded gig network ports. think mass rebuild and 100 or 200 buildroots initialising at once. it will stress our resources on all levels. but the flexibility of so many nodes could allow us to deploy solid solutions to scale and show that fedora is still the leader in open infrastructure and sets industry best practices.
Yeah, we could hopefully have another network thats larger than /24 for the arm builders.
I'm sure some of this will be a process of 'oh no, what we have now doesn't scale, lets fix it'. Of course some of it we can get ready for up front too.
Overall I like the idea of the automated builder re-install and think it will get us more ready for things like a large arm cluster.
kevin