The discussion on devel list about ARM and my work last week on
reinstalling builders quickly and commonly has raised a number of
issues with how we manage our builders and how we should manage them in
It is apparent that if we add arm builders they will be lots of
physical systems (probably in a very small space) but physical,
none-the-less. So we need a sensible way to manage and reinstall these
hosts commonly and quickly.
Additionally, we need to consider what the introduction of a largish
number of arm builders (and other arm infrastructure) would do to our
existing puppet setup. Specifically overloading it pretty badly and
making it not-very-manageable.
I'm making certain assumptions here and I'd like to be clear about what
1. the builders need to be kept pristine
2. that currently our builders are not freshly installed frequently
3. that the builders are relatively static in their
configuration and most changes are done with pkg additions
4. that builder setups require at least two manual-ish steps of a koji
admin who can disable/enable/register the builder with the kojihub.
5. that the builders are fairly different networking and setup-wise to
the rest of our systems.
So I am proposing that we consider the following as a general process
for maintaining our builders:
1. disable the builder in koji
2. make sure all jobs are finished
3. add installer entries into grub (or run the undefine, reinstall
process if the builder is virt-based)
4. reinstall the system
5. monitor for ssh to return
6. connect in and force our post-install configuration: identification,
network, mount-point setup, ssl certs/keys for koji, etc
8. re-enable host in koji
We would do this with frequency and regularity. Perhaps even having
some percentage of our builders doing this at all times. Ie: 1/10th of
the boxes reinstalling at any given moment so in a certain time
frame*10 all of them are reinstalled.
Additionally, this would mean these systems would NOT have a puppet
management piece at all. Package updates would still be handled
by pushes as we do now, if things were security critical, but barring
the need for significant changes we could rely on the boxes simply being
refreshed frequently enough that it wouldn't need to be pushed.
What do folks think about this idea? It would dramatically reduce the
node entries in our puppet config, it would drop the number of hosts
connecting to puppet, too. It will mean more systems being reinstalled
and more often. It will also require some work to make the steps I
mention above be automated. I think I can achieve that without too much
difficulty, actually. I think, in general, it will increase our ability
to scale up to more and more builders.
I'd like input, constructive, please.
Just had a talk with tflink on IRC about the management of the qa
network machines. Long ago when we setup those machines we were
thinking we could use them as a testbed for bcfg2 to see if we wanted
to start using it or if it worked ok, etc. I setup a bcfg2 server to
try this with, but sadly have never found the time to even start
virthost-comm01.qa (real hardware)
(someday we may add a sign-bridge-comm01 and sign-vault-comm01 to allow
secondary archs like ppc and arm to sign packages).
- Try and push forward with a bcfg2 setup on lockbox-comm01.qa and
evaluate it. This would be nice, but I'm really not sure anyone has
the time to do it.
- Just add all the above machines to our puppet repo and configure them
there and call it done. This would mean they wouldn't be seperate
from us and we just update and configure and monitor them like any
- Try and work out some setup with ansible or the like to see if it
could manage them. Again, this would be a learning and tweaking
curve, so not sure we have the time.
- We could setup a new puppet for them on lockbox-comm01.qa and use
that to manage them. We could reuse a lot of our current puppet
setup, but it would still be a fair bit of work to get it all
Thoughts? Brilliant ideas?
Here is the list of needed package reviews for fedmsg.
These ones are already done
These two are not package reviews, but are tickets that need to be
resolved in order to move forwards with fedmsg in stg:
All of the above are dependencies of the latest major version bump of
Moksha. I haven't yet submitted the review request for python-fedmsg itself,
but it's coming soon.
GSoC 2012 selection process is over ! We had a high demand
from the students and since we are offering some limited slots we
couldn't accept all the good students.
Therefore we are planning to launch a program for returning students
(who didnt select for GSoC with Fedora).
The structure of the program is not yet finalized but certainly we need
some tasks, so that students can work on those. If you are interested
in adding a task to the list please feel free.
Please note the number of hours needed to complete the task and contact
details of the person who should be contacted in case of getting more
Nothing is finalized please join with us if you are interested to
shape the program.
Buddhike Chandradeepa Kurera(bckurera)
Fedora Ambassador - APAC region
Event Liaison - Design Team
Email: email@example.com | IRC: bckurera
My name is Keith McGrellis and I live in Belfast, Northern Ireland.
I've worked with linux both at work and at home for about 15 years now.
I've used a mixture of distributions, including Red Hat, Debian,
Ubuntu and Fedora.
My main background is sys admin and scripting (mainly bash and perl).
I've experience with:
and other general linux admin.
I don't know python but am willing and would like to learn.
I would like to be able to help out in whatever way I can.
My irc nick is kmcgrell
As most/all people know we attempted to migrate hosted to using a
gluster backend across two systems on wednesday evening. Thursday we
awoke to a host of problems and tackled solving them. Thursday evening
we migrated back to our previous configuration.
Thanks for the patience on thursday everyone.
The below is the explanation of what all happened:
Hosted migration started on wednesday afternoon
plan was to move to glusterfs from a single node/drbd failover
hosted01 and hosted02 would become 'hosted' - both serving files
from /srv (our glusterfs share)
both systems were clients and servers (in glusters sense):
- both systems exporting a brick of the same replica.
- both systems mounting that replicated share.
when mounting with fuse we started seeing pretty serious performance
issues to the point that users were complaining it was not working. It
would take 20-30s to render a single ticket from trac.
We switched to nfs mounts and performance improved but we saw enormous
number of db locking issues on the servers.
At this point we contacted the gluster upstream developers who were
outrageously helpful in tracking down the problems.
After some research it was determined that:
- gluster 3.2 over nfs doesn't support any remote locking at all
- if we brought things down to 1 node and local_lock=all then things
would work and perform 'ok' but would not allow us to access from the
- this meant we could replicate the fs but not use it from both hosts
After moving to gluster over nfs we ran into a new problem:
gluster's nfs server does not support --manage-gids so we were
restricted to 16 gids per user. No solution outside of new code for
this one - investigation into doing that for gluster ++ is occurring
jdarcy and pranithk given sysadmin-hosted access to look at logs
directly on hosted01/02 to look up on the split-brain reports we were
jdarcy and pranithk tracked the self-heal/split brain problems back to
dirs with out of sync fattrs. The only way to solve this was to
manually remove the out of sync fattrs after verifying that ONLY the
fattrs were out of sync and not any data.
this involved looking at all dirs with self-heal problems and running:
> setfattr -x trusted.afr.hosted-client-0 /glusterfs/hosted/$dir
> setfattr -x trusted.afr.hosted-client-1 /glusterfs/hosted/$dir
to clear those settings then reaccessing the dir at:
to force the self-heal to complete correctly.
At this point we did not appear to be having self-heal issues but we
still have the group-ids limited to 16 under the nfs clients.
The only option to resolve that is to patch the gluster nfs server to
do the equivalent of --manage-gids.
We attempted to see if we could optimize the fuse mounts to work around
the nfs limitations. We set the fuse mount up hosted02 and did
performance tests - they were 'okay' but not really acceptable.
Additionally, after testing fuse enhancements we were informed that fuse
suffers from the same 16 gid limitation that nfs suffers from. so we
are completely dead in the water.
We punted back to hosted03 - re-rsyncing everything back.
We also setup a new host: hosted-list01.fedoraproject.org at internetx.
This will allow us to move the hosted mailing lists OFF of
fedorahosted.org which gains us a lot of latitude in how we move around
projects that we did not have before.
We will start on the gluster migration + testing if/when we get a patch
for 3.3 from jdarcy to handle > 16gids via nfs.
If that occurs we will be testing to handle the following problems:
Tests to run once we get 3.3 and the > 16 gid patch in place:
1. that nfs locking actually works (test with local_lock=none) and a
sqlite3 .dump rm -f /srv/trac/projects/fedora-infrastructure/db/fixed.db
sqlite3 /srv/trac/projects/fedora-infrastructure/db/trac.db |
2. that writes with a gid beyond 16 works
3. that performance is palatable: cloning git repos
4. test trac with both systems
5. look for self-healing issues
6. failover testing. Kill one node and confirm other works with limited
Things to do before production of gluster:
- MOVE GITWEB CACHING OFF OF /srv
Much thanks to the gluster dev team in helping us track down where the
problems were coming from and attempting to help us fix them. Their
help was indispensable.
The infrastructure team will be having it's weekly meeting tomorrow
2012-04-26 at 18:00 UTC in #fedora-meeting on the freenode network.
#topic New folks introductions and Apprentice tasks.
If any new folks want to give a quick one line bio or any apprentices
would like to ask general questions, they can do so here.
#topic two factor auth status
#topic Staging re-work status
#topic Applications status / discussion
Check in on status of our applications: pkgdb, fas, bodhi, koji,
community, voting, tagger, packager, dpsearch, etc.
If there's new releases, bugs we need to work around or things to note.
#topic Meeting time
#topic Upcoming Tasks/Items
#info 2012-04-29 to 2012-05-03 - Kevin out in the wilds of NM
#info 2012-05-01 to 2012-05-15 - F17 Final Freeze.
#info 2012-05-01 - nag fi-apprentices.
#info 2011-05-03 - gitweb-cache removal day.
#info 2012-05-09 - Check if puppet works on f17 yet.
#info 2012-05-10 - drop inactive fi-apprentices
#info 2012-05-15 - F17 release
#topic Meeting tagged tickets:
#topic Open Floor
Submit your agenda items, as tickets in the trac instance and send a
note replying to this thread.
More info here:
Information as requested:
I have 8 months of previous experience maintaining 9 RHEL 5.5 Servers
within a small corporate data center. (user administration, rpm updates
/ package management, kernel upgrades, HP PSP updates) I also possess
roughly 10 years of general IT experience (Windows server/client) and
hold various certifications including the Cisco CCNP.
I completed the RHEL Essential's training and most recently took the
LPIC-1 course from CBT Nuggets. I really enjoy supporting Linux systems
and would like to contribute where possible while continuing to learn
general Linux server/desktop administration.
Hello, people. I hope you are ok.
These are my skills:
- Java Programming (Desktop / Web) - Intermediary
- C Programming - Intermediary
- C++ Programming - Intermediary
- C# Programming (Beginner)
- Shell Script
- Computer Networks
- Software Engineering
and I want to learn:
and I want to learn more C, C++, Linux Development and what will come. =)