Hey folks,
The latest version of FMN in production includes a patch[0] that breaks all the rules that query for package watchers, resulting in this[1] infrastructure issue. There's an open PR[2] on FMN that fixes the issue (reviews welcome). To fix this we have two options.
The first is to backport it to the current version in production (1.5) which should be trivial since nothing in this area has been touched in 2.0. We can then update production and carry on.
The second option is to update production to 2.0 now (I've included [2] as a patch in the RPM currently in stage). 2.0 includes a re-write of the back-end components of FMN to use Celery. It's running in stage now. Things to note about this:
* The FMN back-end now requires F26 because of celery versions.
* The FMN front-end is currently still on RHEL7, but I haven't updated it in stage yet so I don't know if there's any adjustments necessary for that (the front-end doesn't use celery so the fact that it's old _shouldn't_ be a problem).
* Some care will need to be taken to switch over AMQP queue-wise, especially because the current FMN queues are jammed with unformatable messages it keeps requeuing (about 25K of them). We could also just cut our losses and drop these.
* The scripts that monitor queue length will need to be adjusted since there are more queues now and existing queues have been renamed.
One thing to note is that we're going to have to go through all those things above at some point anyway. FMN also doesn't really have anything to do with the release process so if it all goes south during the freeze it shouldn't matter.
I don't have a preference one way or the other, really. Whatever makes the admins happy makes me happy.
[0] https://github.com/fedora-infra/fmn/pull/206 [1] https://pagure.io/fedora-infrastructure/issue/6462 [2] https://github.com/fedora-infra/fmn/pull/248
On 11/01/2017 06:18 PM, Jeremy Cline wrote> * The FMN front-end is currently still on RHEL7, but I haven't updated
it in stage yet so I don't know if there's any adjustments necessary for that (the front-end doesn't use celery so the fact that it's old _shouldn't_ be a problem).
For what it's worth, I did this today and it seems to be fine.
On 11/01/2017 03:18 PM, Jeremy Cline wrote:
Hey folks,
The latest version of FMN in production includes a patch[0] that breaks all the rules that query for package watchers, resulting in this[1] infrastructure issue. There's an open PR[2] on FMN that fixes the issue (reviews welcome). To fix this we have two options.
The first is to backport it to the current version in production (1.5) which should be trivial since nothing in this area has been touched in 2.0. We can then update production and carry on.
The second option is to update production to 2.0 now (I've included [2] as a patch in the RPM currently in stage). 2.0 includes a re-write of the back-end components of FMN to use Celery. It's running in stage now. Things to note about this:
The FMN back-end now requires F26 because of celery versions.
The FMN front-end is currently still on RHEL7, but I haven't updated it in stage yet so I don't know if there's any adjustments necessary for that (the front-end doesn't use celery so the fact that it's old _shouldn't_ be a problem).
Some care will need to be taken to switch over AMQP queue-wise, especially because the current FMN queues are jammed with unformatable messages it keeps requeuing (about 25K of them). We could also just cut our losses and drop these.
The scripts that monitor queue length will need to be adjusted since there are more queues now and existing queues have been renamed.
One thing to note is that we're going to have to go through all those things above at some point anyway. FMN also doesn't really have anything to do with the release process so if it all goes south during the freeze it shouldn't matter.
I don't have a preference one way or the other, really. Whatever makes the admins happy makes me happy.
I'm a bit torn on this one. It seems a bit of a rush to push into prod without having tested the frontends and confirmed that one fix for watchers, but on the other hand nothing around release should block on this and it would be nice to get prod on a code base that we have more confidence in and ability to further fix.
do we have any way to tell what all those bad 25k messages are? Likely copr rubygems rebuild ones? If thats all they are I am fine with dropping them and starting afresh.
Can you make a patch to fix the monitoring scripts and attach it and update the staging frontends and confirm they are ok?
With those in hand, I would be +1 just to upgrade and drop the old messages I think.
kevin
The worker queue has been renamed to "fmn.tasks.unprocessed_messages" both for clarity and because the message format has changed. The "backends" queue is no more and has been replaced by one queue per delivery medium. Right now that means there is "fmn.backends.irc" and "fmn.backends.email".
Signed-off-by: Jeremy Cline jeremy@jcline.org --- roles/nagios_client/templates/check_fmn.cfg.j2 | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/roles/nagios_client/templates/check_fmn.cfg.j2 b/roles/nagios_client/templates/check_fmn.cfg.j2 index 05111bdb2..8052eea8f 100644 --- a/roles/nagios_client/templates/check_fmn.cfg.j2 +++ b/roles/nagios_client/templates/check_fmn.cfg.j2 @@ -1,2 +1,3 @@ -command[check_fmn_worker_queue]={{ libdir }}/nagios/plugins/check_rabbitmq_size workers 200 1000 -command[check_fmn_backend_queue]={{ libdir }}/nagios/plugins/check_rabbitmq_size backends 100 200 +command[check_fmn_worker_queue]={{ libdir }}/nagios/plugins/check_rabbitmq_size fmn.tasks.unprocessed_messages 200 1000 +command[check_fmn_backend_irc_queue]={{ libdir }}/nagios/plugins/check_rabbitmq_size fmn.backends.irc 100 200 +command[check_fmn_backend_email_queue]={{ libdir }}/nagios/plugins/check_rabbitmq_size fmn.backends.email 100 200
On 11/03/2017 07:28 AM, Jeremy Cline wrote:
The worker queue has been renamed to "fmn.tasks.unprocessed_messages" both for clarity and because the message format has changed. The "backends" queue is no more and has been replaced by one queue per delivery medium. Right now that means there is "fmn.backends.irc" and "fmn.backends.email".
Signed-off-by: Jeremy Cline jeremy@jcline.org
roles/nagios_client/templates/check_fmn.cfg.j2 | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/roles/nagios_client/templates/check_fmn.cfg.j2 b/roles/nagios_client/templates/check_fmn.cfg.j2 index 05111bdb2..8052eea8f 100644 --- a/roles/nagios_client/templates/check_fmn.cfg.j2 +++ b/roles/nagios_client/templates/check_fmn.cfg.j2 @@ -1,2 +1,3 @@ -command[check_fmn_worker_queue]={{ libdir }}/nagios/plugins/check_rabbitmq_size workers 200 1000 -command[check_fmn_backend_queue]={{ libdir }}/nagios/plugins/check_rabbitmq_size backends 100 200 +command[check_fmn_worker_queue]={{ libdir }}/nagios/plugins/check_rabbitmq_size fmn.tasks.unprocessed_messages 200 1000 +command[check_fmn_backend_irc_queue]={{ libdir }}/nagios/plugins/check_rabbitmq_size fmn.backends.irc 100 200 +command[check_fmn_backend_email_queue]={{ libdir }}/nagios/plugins/check_rabbitmq_size fmn.backends.email 100 200
Looks good to me. Of course only needs to be applied if we push the new FMN in... +1
kevin
On 11/02/2017 04:42 PM, Kevin Fenzi wrote:
I'm a bit torn on this one. It seems a bit of a rush to push into prod without having tested the frontends and confirmed that one fix for watchers, but on the other hand nothing around release should block on this and it would be nice to get prod on a code base that we have more confidence in and ability to further fix.
I updated the front-ends last week and things seem to be running smoothly. jgrulich also gave staging a test and says it fixed his issue.
do we have any way to tell what all those bad 25k messages are? Likely copr rubygems rebuild ones? If thats all they are I am fine with dropping them and starting afresh.
Based on the tracebacks in the back-end logs I'd say most of them are COPR messages. Obviously there will be a couple other messages stuck somewhere in the queue, but I'm not sure how much effort we should put in to "save" those. After all, the more time we spend on that, the more notifications we lose due to that bug.
On 11/06/2017 07:36 AM, Jeremy Cline wrote:
On 11/02/2017 04:42 PM, Kevin Fenzi wrote:
I'm a bit torn on this one. It seems a bit of a rush to push into prod without having tested the frontends and confirmed that one fix for watchers, but on the other hand nothing around release should block on this and it would be nice to get prod on a code base that we have more confidence in and ability to further fix.
I updated the front-ends last week and things seem to be running smoothly. jgrulich also gave staging a test and says it fixed his issue.
Sweet. :)
do we have any way to tell what all those bad 25k messages are? Likely copr rubygems rebuild ones? If thats all they are I am fine with dropping them and starting afresh.
Based on the tracebacks in the back-end logs I'd say most of them are COPR messages. Obviously there will be a couple other messages stuck somewhere in the queue, but I'm not sure how much effort we should put in to "save" those. After all, the more time we spend on that, the more notifications we lose due to that bug.
Indeed.
ok, I am +1 for just making a new f26 based notifs-backend01 and installing anew on it and abandoning the old queue.
Can we get one other +1?
I'd be happy to help with this tomorrow, provided we get another +1
kevin
infrastructure@lists.fedoraproject.org