Greetings all.
This email will cover days 3 and 4, as by the time I was going to send yesterdays it was late and mailman was still down anyhow. :)
So, yesterday started out seeming like a pretty simple day, but didn't turn out that way. We planned to move only two things and work on fixing issues from the buildsystem and other moves in the first two days.
* datagrepper / datanommer. This took until this morning as the database is really gigantic. Again, we wanted to load it into a more modern postgres. Now that it's moved and on postgres 12.2, we will be looking into partitioning the data (perhaps by month? quarter?) so queries for anything recent are much faster.
* mailman / lists: This turned out to be our biggest problem of the move. :( We are working on getting this install moved over to recent fedora or rhel, but for now it's rhel7 and python34. Because of that we decided to just copy the instance over entire and adjust it over a fresh install. The copy ran most of the day, and was nearing completion but then we acidentally resized the orig instance. :( We resized it back, but the filesystem was messed up and the instance would no longer boot. It was at this point we decided that lack of sleep could leed to poor decisions and mistakes and we started a copy off of the data on the copy to another freshly installed instance and went and got some sleep.
The next day, in a stroke of luck, the copy we were doing had already copied all the disk that had data on it, so we were able to fsck it and resize it and we were back in business. mailman/lists was back up this morning and happily processing away.
Today, in addition to finishing the above two migrations from yesterday, we moved:
* openqa. Right now it doesn't have any arm or power workers, but we have some almost ready to go there that we should have in place next week.
* Various openshift apps (docsbuilding, websites building, cron jobs, etc). We even have release-monitoring and the new hotness up and running. I am trying to bring koschei up as well, but it needs some more work.
* Some small misc apps: blockerbugs, kerneltest, etc.
* We also fixed tons and tons of issues all over the map. Mostly around things reaching other things or something not running for some configuration reason.
At this point everything we planned to be in the minimal fedora should be up and working. We do have a more capacity than we need, so if things go smoothly without too many more things to fix, I'd like to see about bringing up badges as it's a popular app and if we have capacity and can easily do it we can bring it up.
Tomorrow and this weekend we are going to work on taking things down in the old datacenter and get them ready for shipping next week. They will be in transit next week, then we hopefully can get them racked and built and start adding capacity back the week after.
So, if you notice something not working now, please do look to see if there's already a ticket on it, and if not please file one. ( https://pagure.io/fedora-infrastructure/issues ).
Overall things went pretty good from my view, and I would really like to thank the awesome fedora community for being patient with us. I was pretty surprised how few people asked why things were down and when they did other community memebers were quick to tell them.
kevin
On Thu, 2020-06-11 at 21:09 -0700, Kevin Fenzi wrote:
- openqa. Right now it doesn't have any arm or power workers, but we
have some almost ready to go there that we should have in place next week.
quick clarification on this: up till now, we've always had aarch64 and ppc64 tests running only on the staging instance, prod has always been only x86_64. I've been sort of planning to move at least aarch64 to production for a while now, but I've been a bit reluctant because we still seem to hit a lot of flakes there (and the extremely long- standing https://bugzilla.redhat.com/show_bug.cgi?id=1689037 is still a big problem in aarch64 tests, and I still can't get any headway on getting that fixed).
Right now it seems we won't have a staging instance (we may rename the second instance to 'test' or something, by the by, because 'staging' is a bad name for it as it's not really like other 'staging' things in infra) for a bit because we're running on a fairly minimal set of hardware in the new DC for now.
So upshot is I'm kinda still unsure about how to proceed with that part of things. I'll talk to Kevin and Smooge tomorrow about exactly when we're likely to get a) the second server instance and b) the other worker hosts, and try to come up with a plan.
so for now we just have the new prod instance operational and it's only running x86_64 tests (on a single worker host for now). it's working more or less fully, but right at this minute quite a lot of tests are failing due to a 'russian roulette' sort of problem (there are four possible IPs that mirrors.fedoraproject.org can resolve to from inside the new DC, and it seems that from the new worker host, two of them work and two of them don't).
thanks to nirik and smooge for all their hard work, and sorry for the spurious test failures, I'll try to get that sorted ASAP.
infrastructure@lists.fedoraproject.org