Greetings.
Just a heads up that later today I intend to reinstall the staging rabbitmq cluster with rhel8 and a newer rabbitmq. (in the early hours of 2020-02-14 UTC).
Playbooks then will be run to resetup users and queues.
Abompard will check over everything when he gets in the first thing in his morning tomorrow, so any problems will hopefully be not too many hours long.
This upgrade should allow us some nice new features and if all looks well we will schedule a time to do production as well.
Just wanted to let everyone know in case you saw alerts or missing messages later today.
Thanks for your patience.
kevin
Hey folks,
I thought I'd make a summary of where I'm at. Here are the issues I found and what I did about it:
- We ran into an Ansible issue that the PR https://github.com/ansible/ansible/pull/50381 fixes. I've asked pingou to patch batcave since it's basically a one-liner that will keep working with the older prod version.
- When starting a RabbitMQ cluster from scratch, there is a race condition that is documented here: https://www.rabbitmq.com/cluster-formation.html#initial-formation-race-condi... On nodes 02 and 03, I've just destroyed the database and let it auto-detect the cluster again # systemctl stop rabbitmq-server && rm -rf /var/lib/rabbitmq/mnesia/ && systemctl start rabbitmq-server It worked fine. I checked with "rabbitmqctl list_users" that all nodes had the same users declared.
- I've also fixed a couple things in the playbooks that assumed the cluster to be up and setup already.
- I've rebuilt collectd-rabbitmq for EPEL8 but we currently only install it on production apparently (not sure why, I think it could be useful in staging.
- The nagios-plugins-rabbitmq RPM still fails to install because of a dependency bug in perl-Monitoring-Plugin, I've opened a ticket about it: https://bugzilla.redhat.com/show_bug.cgi?id=1803121
Now, we need to recreate the queues, users and bindings, and I don't have the permissions to run all the playbooks. If someone could run the master playbook limited on staging and on the rabbitmq_cluster tag, I think it should recreate all users and queues and we should be all set.
I'm around and on IRC if you need me.
Aurélien
On Fri, Feb 14, 2020 at 03:54:48PM +0100, Aurelien Bompard wrote:
Hey folks,
I thought I'd make a summary of where I'm at. Here are the issues I found and what I did about it:
- We ran into an Ansible issue that the PR
https://github.com/ansible/ansible/pull/50381 fixes. I've asked pingou to patch batcave since it's basically a one-liner that will keep working with the older prod version.
- When starting a RabbitMQ cluster from scratch, there is a race condition
that is documented here: https://www.rabbitmq.com/cluster-formation.html#initial-formation-race-condi... On nodes 02 and 03, I've just destroyed the database and let it auto-detect the cluster again # systemctl stop rabbitmq-server && rm -rf /var/lib/rabbitmq/mnesia/ && systemctl start rabbitmq-server It worked fine. I checked with "rabbitmqctl list_users" that all nodes had the same users declared.
- I've also fixed a couple things in the playbooks that assumed the cluster
to be up and setup already.
- I've rebuilt collectd-rabbitmq for EPEL8 but we currently only install it
on production apparently (not sure why, I think it could be useful in staging.
I think that was me disabling it in stg because it wasn't working?
- The nagios-plugins-rabbitmq RPM still fails to install because of a
dependency bug in perl-Monitoring-Plugin, I've opened a ticket about it: https://bugzilla.redhat.com/show_bug.cgi?id=1803121
Now, we need to recreate the queues, users and bindings, and I don't have the permissions to run all the playbooks. If someone could run the master playbook limited on staging and on the rabbitmq_cluster tag, I think it should recreate all users and queues and we should be all set.
on it.
I'm around and on IRC if you need me.
Thanks much for working on this. :)
I guess the next step is to add the stuff that we needed this new version for and confirm it works? Then on to production?
kevin
I hit some permissions problems with the playbook that I can't figure out.
For example, the bodhi playbook fails trying to setup a bodhi queue, saying it doesn't have permissions to see if the queue exists or not.
it's authenticating as admin, which has .* perms as far as I can tell, and I tried adding 'administrator' and 'maint' tags, but that didn't help. I am not sure what permission it's lacking. ;(
kevin
I hit some permissions problems with the playbook that I can't figure out.
I found why, apparently when tags (rabbitmq tags, not ansible tags) aren't specified with the rabbitmq_user ansible module, it clears them while I thought it would leave them alone. I've fixed it, it should work now. Thanks
Aurélien
On Sat, Feb 15, 2020 at 11:08:44PM +0100, Aurelien Bompard wrote:
I hit some permissions problems with the playbook that I can't figure out.
I found why, apparently when tags (rabbitmq tags, not ansible tags) aren't specified with the rabbitmq_user ansible module, it clears them while I thought it would leave them alone. I've fixed it, it should work now.
Ah ha! That makes sense... thanks for tracking this down. :)
I ran the playbook and indeed it finished fine.
There may be things that need to be restarted, but otherwise I think it's all working now?
kevin
infrastructure@lists.fedoraproject.org