I'd like to apply the following which does: - Adds a script I wrote for reading a timestamp from a file on disk and alerting if the timestamp within it is NOT within a particular delta to now. - Applies this to sundries01 and uses it to check /srv/websites/getfedora.org/build.timestamp.txt which now gets generated as part of the websites build.
The purpose is because sometimes someone will commit something to the websites repo which breaks the build, but because of how we have things set up in openshift (cronjob), we don't get any kind of alert when that happens.
Right now this sets the delta to 3 hours. In theory it should be 1, but I figure let it try to build a few times before we start alerting.
Rick
commit 657d050f6d699bc43973d968cd93d12131fca7f2 Author: Rick Elrod relrod@redhat.com Date: Thu Feb 27 05:29:24 2020 +0000
nagios: Add script and check for checking that a timestamp within a file is within a delta of now, and then use this for alerting when websites stop building
Signed-off-by: Rick Elrod relrod@redhat.com
diff --git a/roles/nagios_client/files/scripts/check_timestamp_from_file b/roles/nagios_client/files/scripts/check_timestamp_from_file new file mode 100644 index 0000000..9064337 --- /dev/null +++ b/roles/nagios_client/files/scripts/check_timestamp_from_file @@ -0,0 +1,43 @@ +#!/usr/bin/env python + +# Takes a path to a file and a delta. The file must simply contain an epoch +# timestamp. It can be an integer or a float, as can the delta. +# +# Alerts critical if (now - timestamp contained in file) > delta. +# +# Rick Elrod relrod@redhat.com +# MIT + +import sys +import time + +if len(sys.argv) != 3: + print('UNKNOWN: Pass path to file and delta as parameters') + sys.exit(3) + +filename = sys.argv[1] +delta = float(sys.argv[2]) + +timestamp = None + +try: + with open(filename, 'r') as f: + timestamp = float(f.read().strip()) +except Exception as e: + print('UNKNOWN: Unable to open/read file path') + sys.exit(3) + +difference = round(time.time() - timestamp, 2) +if difference > delta: + print( + 'CRITICAL: Timestamp in file (%.2f) exceeds delta (%.2f) by %.2f seconds' % ( + timestamp, + delta, + difference - delta)) + sys.exit(2) + +print('OK: Timestamp in file (%.2f) is within delta (%.2f) of now, by %.2f seconds' % ( + timestamp, + delta, + abs(difference - delta))) +sys.exit(0) diff --git a/roles/nagios_client/tasks/main.yml b/roles/nagios_client/tasks/main.yml index 2e5e0df..8e71a3b 100644 --- a/roles/nagios_client/tasks/main.yml +++ b/roles/nagios_client/tasks/main.yml @@ -47,6 +47,7 @@ - check_osbs_api.py - check_ipa_replication - check_redis_queue.sh + - check_timestamp_from_file when: not inventory_hostname.startswith('noc') tags: - nagios_client @@ -226,6 +227,16 @@ tags: - nagios_client
+- name: install nrpe checks for sundries/websites + template: src={{ item }}.j2 dest=/etc/nrpe.d/{{ item }} owner=root group=root mode=0644 + with_items: + - check_websites_buildtime.cfg + when: inventory_hostname.startswith('sundries') + notify: + - restart nrpe + tags: + - nagios_client + - name: install nrpe config for the RabbitMQ checks template: src: "rabbitmq_args.ini.j2" diff --git a/roles/nagios_client/templates/check_websites_buildtime.cfg.j2 b/roles/nagios_client/templates/check_websites_buildtime.cfg.j2 new file mode 100644 index 0000000..ff5639d --- /dev/null +++ b/roles/nagios_client/templates/check_websites_buildtime.cfg.j2 @@ -0,0 +1,2 @@ +# Alert if websites haven't been built in 3 hours +command[check_websites_buildtime]={{ libdir }}/nagios/plugins/check_timestamp_from_file /srv/websites/getfedora.org/build.timestamp.txt 10800 diff --git a/roles/nagios_server/templates/nagios/services/websites.cfg.j2 b/roles/nagios_server/templates/nagios/services/websites.cfg.j2 index 85e8f8e..c8958d7 100644 --- a/roles/nagios_server/templates/nagios/services/websites.cfg.j2 +++ b/roles/nagios_server/templates/nagios/services/websites.cfg.j2 @@ -316,4 +316,14 @@ define service { use ppc-secondarytemplate }
+## Auxillary to websites but necessary to make them happen + +define service { + host_name sundries01.phx2.fedoraproject.org + service_description websites build happened recently + check_command check_by_nrpe!check_websites_buildtime + use websitetemplate +} + + {% endif %}
On Thu, 27 Feb 2020 at 06:53, Rick Elrod codeblock@elrod.me wrote:
I'd like to apply the following which does:
- Adds a script I wrote for reading a timestamp from a file on disk
and alerting if the timestamp within it is NOT within a particular delta to now.
- Applies this to sundries01 and uses it to check
/srv/websites/getfedora.org/build.timestamp.txt which now gets generated as part of the websites build.
The purpose is because sometimes someone will commit something to the websites repo which breaks the build, but because of how we have things set up in openshift (cronjob), we don't get any kind of alert when that happens.
I think it would be better to find a way to monitor the cronjob in OpenShift since that will be useful for other projects. Did you investigate that idea ?
Right now this sets the delta to 3 hours. In theory it should be 1, but I figure let it try to build a few times before we start alerting.
+1 but I would prefer a way to have notification on a failed cronjob :-)
Rick
commit 657d050f6d699bc43973d968cd93d12131fca7f2 Author: Rick Elrod relrod@redhat.com Date: Thu Feb 27 05:29:24 2020 +0000
nagios: Add script and check for checking that a timestamp within
a file is within a delta of now, and then use this for alerting when websites stop building
Signed-off-by: Rick Elrod <relrod@redhat.com>
diff --git a/roles/nagios_client/files/scripts/check_timestamp_from_file b/roles/nagios_client/files/scripts/check_timestamp_from_file new file mode 100644 index 0000000..9064337 --- /dev/null +++ b/roles/nagios_client/files/scripts/check_timestamp_from_file @@ -0,0 +1,43 @@ +#!/usr/bin/env python
+# Takes a path to a file and a delta. The file must simply contain an epoch +# timestamp. It can be an integer or a float, as can the delta. +# +# Alerts critical if (now - timestamp contained in file) > delta. +# +# Rick Elrod relrod@redhat.com +# MIT
+import sys +import time
+if len(sys.argv) != 3:
- print('UNKNOWN: Pass path to file and delta as parameters')
- sys.exit(3)
+filename = sys.argv[1] +delta = float(sys.argv[2])
+timestamp = None
+try:
- with open(filename, 'r') as f:
timestamp = float(f.read().strip())
+except Exception as e:
- print('UNKNOWN: Unable to open/read file path')
- sys.exit(3)
+difference = round(time.time() - timestamp, 2) +if difference > delta:
- print(
'CRITICAL: Timestamp in file (%.2f) exceeds delta (%.2f) by
%.2f seconds' % (
timestamp,
delta,
difference - delta))
- sys.exit(2)
+print('OK: Timestamp in file (%.2f) is within delta (%.2f) of now, by %.2f seconds' % (
- timestamp,
- delta,
- abs(difference - delta)))
+sys.exit(0) diff --git a/roles/nagios_client/tasks/main.yml b/roles/nagios_client/tasks/main.yml index 2e5e0df..8e71a3b 100644 --- a/roles/nagios_client/tasks/main.yml +++ b/roles/nagios_client/tasks/main.yml @@ -47,6 +47,7 @@
- check_osbs_api.py
- check_ipa_replication
- check_redis_queue.sh
when: not inventory_hostname.startswith('noc') tags:
- check_timestamp_from_file
- nagios_client
@@ -226,6 +227,16 @@ tags:
- nagios_client
+- name: install nrpe checks for sundries/websites
- template: src={{ item }}.j2 dest=/etc/nrpe.d/{{ item }} owner=root
group=root mode=0644
- with_items:
- check_websites_buildtime.cfg
- when: inventory_hostname.startswith('sundries')
- notify:
- restart nrpe
- tags:
- nagios_client
- name: install nrpe config for the RabbitMQ checks template: src: "rabbitmq_args.ini.j2"
diff --git a/roles/nagios_client/templates/check_websites_buildtime.cfg.j2 b/roles/nagios_client/templates/check_websites_buildtime.cfg.j2 new file mode 100644 index 0000000..ff5639d --- /dev/null +++ b/roles/nagios_client/templates/check_websites_buildtime.cfg.j2 @@ -0,0 +1,2 @@ +# Alert if websites haven't been built in 3 hours +command[check_websites_buildtime]={{ libdir }}/nagios/plugins/check_timestamp_from_file /srv/websites/getfedora.org/build.timestamp.txt 10800 diff --git a/roles/nagios_server/templates/nagios/services/websites.cfg.j2 b/roles/nagios_server/templates/nagios/services/websites.cfg.j2 index 85e8f8e..c8958d7 100644 --- a/roles/nagios_server/templates/nagios/services/websites.cfg.j2 +++ b/roles/nagios_server/templates/nagios/services/websites.cfg.j2 @@ -316,4 +316,14 @@ define service { use ppc-secondarytemplate }
+## Auxillary to websites but necessary to make them happen
+define service {
- host_name sundries01.phx2.fedoraproject.org
- service_description websites build happened recently
- check_command check_by_nrpe!check_websites_buildtime
- use websitetemplate
+}
{% endif %} _______________________________________________ infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedorapro...
On Thu, Feb 27, 2020 at 4:31 AM Clement Verna cverna@fedoraproject.org wrote:
On Thu, 27 Feb 2020 at 06:53, Rick Elrod codeblock@elrod.me wrote:
I'd like to apply the following which does:
- Adds a script I wrote for reading a timestamp from a file on disk
and alerting if the timestamp within it is NOT within a particular delta to now.
- Applies this to sundries01 and uses it to check
/srv/websites/getfedora.org/build.timestamp.txt which now gets generated as part of the websites build.
The purpose is because sometimes someone will commit something to the websites repo which breaks the build, but because of how we have things set up in openshift (cronjob), we don't get any kind of alert when that happens.
I think it would be better to find a way to monitor the cronjob in OpenShift since that will be useful for other projects. Did you investigate that idea ?
Right now this sets the delta to 3 hours. In theory it should be 1, but I figure let it try to build a few times before we start alerting.
+1 but I would prefer a way to have notification on a failed cronjob :-)
I'd prefer that too (or probably in addition), but I don't know anything about how to set up that monitoring right now. It looks like there's an OpenShift API endpoint for monitoring crons: https://major.io/2019/11/18/monitoring-openshift-cron-jobs/ but we'd need to set up an API key for nagios checks to use somehow. Probably worth looking into, but for the time being I'd still like to apply this FBR, as we are going to have some Outreachy activity happening on websites soon and we need to know that the prod build isn't broken.
-re
Rick
commit 657d050f6d699bc43973d968cd93d12131fca7f2 Author: Rick Elrod relrod@redhat.com Date: Thu Feb 27 05:29:24 2020 +0000
nagios: Add script and check for checking that a timestamp within
a file is within a delta of now, and then use this for alerting when websites stop building
Signed-off-by: Rick Elrod <relrod@redhat.com>
diff --git a/roles/nagios_client/files/scripts/check_timestamp_from_file b/roles/nagios_client/files/scripts/check_timestamp_from_file new file mode 100644 index 0000000..9064337 --- /dev/null +++ b/roles/nagios_client/files/scripts/check_timestamp_from_file @@ -0,0 +1,43 @@ +#!/usr/bin/env python
+# Takes a path to a file and a delta. The file must simply contain an epoch +# timestamp. It can be an integer or a float, as can the delta. +# +# Alerts critical if (now - timestamp contained in file) > delta. +# +# Rick Elrod relrod@redhat.com +# MIT
+import sys +import time
+if len(sys.argv) != 3:
- print('UNKNOWN: Pass path to file and delta as parameters')
- sys.exit(3)
+filename = sys.argv[1] +delta = float(sys.argv[2])
+timestamp = None
+try:
- with open(filename, 'r') as f:
timestamp = float(f.read().strip())
+except Exception as e:
- print('UNKNOWN: Unable to open/read file path')
- sys.exit(3)
+difference = round(time.time() - timestamp, 2) +if difference > delta:
- print(
'CRITICAL: Timestamp in file (%.2f) exceeds delta (%.2f) by
%.2f seconds' % (
timestamp,
delta,
difference - delta))
- sys.exit(2)
+print('OK: Timestamp in file (%.2f) is within delta (%.2f) of now, by %.2f seconds' % (
- timestamp,
- delta,
- abs(difference - delta)))
+sys.exit(0) diff --git a/roles/nagios_client/tasks/main.yml b/roles/nagios_client/tasks/main.yml index 2e5e0df..8e71a3b 100644 --- a/roles/nagios_client/tasks/main.yml +++ b/roles/nagios_client/tasks/main.yml @@ -47,6 +47,7 @@
- check_osbs_api.py
- check_ipa_replication
- check_redis_queue.sh
when: not inventory_hostname.startswith('noc') tags:
- check_timestamp_from_file
- nagios_client
@@ -226,6 +227,16 @@ tags:
- nagios_client
+- name: install nrpe checks for sundries/websites
- template: src={{ item }}.j2 dest=/etc/nrpe.d/{{ item }} owner=root
group=root mode=0644
- with_items:
- check_websites_buildtime.cfg
- when: inventory_hostname.startswith('sundries')
- notify:
- restart nrpe
- tags:
- nagios_client
- name: install nrpe config for the RabbitMQ checks template: src: "rabbitmq_args.ini.j2"
diff --git a/roles/nagios_client/templates/check_websites_buildtime.cfg.j2 b/roles/nagios_client/templates/check_websites_buildtime.cfg.j2 new file mode 100644 index 0000000..ff5639d --- /dev/null +++ b/roles/nagios_client/templates/check_websites_buildtime.cfg.j2 @@ -0,0 +1,2 @@ +# Alert if websites haven't been built in 3 hours +command[check_websites_buildtime]={{ libdir }}/nagios/plugins/check_timestamp_from_file /srv/websites/getfedora.org/build.timestamp.txt 10800 diff --git a/roles/nagios_server/templates/nagios/services/websites.cfg.j2 b/roles/nagios_server/templates/nagios/services/websites.cfg.j2 index 85e8f8e..c8958d7 100644 --- a/roles/nagios_server/templates/nagios/services/websites.cfg.j2 +++ b/roles/nagios_server/templates/nagios/services/websites.cfg.j2 @@ -316,4 +316,14 @@ define service { use ppc-secondarytemplate }
+## Auxillary to websites but necessary to make them happen
+define service {
- host_name sundries01.phx2.fedoraproject.org
- service_description websites build happened recently
- check_command check_by_nrpe!check_websites_buildtime
- use websitetemplate
+}
{% endif %} _______________________________________________ infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedorapro...
infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedorapro...
On Thu, 27 Feb 2020 at 12:03, Rick Elrod codeblock@elrod.me wrote:
On Thu, Feb 27, 2020 at 4:31 AM Clement Verna cverna@fedoraproject.org wrote:
On Thu, 27 Feb 2020 at 06:53, Rick Elrod codeblock@elrod.me wrote:
I'd like to apply the following which does:
- Adds a script I wrote for reading a timestamp from a file on disk
and alerting if the timestamp within it is NOT within a particular delta to now.
- Applies this to sundries01 and uses it to check
/srv/websites/getfedora.org/build.timestamp.txt which now gets generated as part of the websites build.
The purpose is because sometimes someone will commit something to the websites repo which breaks the build, but because of how we have things set up in openshift (cronjob), we don't get any kind of alert when that happens.
I think it would be better to find a way to monitor the cronjob in
OpenShift since that will be useful for other projects.
Did you investigate that idea ?
Right now this sets the delta to 3 hours. In theory it should be 1, but I figure let it try to build a few times before we start alerting.
+1 but I would prefer a way to have notification on a failed cronjob :-)
I'd prefer that too (or probably in addition), but I don't know anything about how to set up that monitoring right now. It looks like there's an OpenShift API endpoint for monitoring crons: https://major.io/2019/11/18/monitoring-openshift-cron-jobs/ but we'd need to set up an API key for nagios checks to use somehow.
Yes I think we would need to have a "nagios" service account, then that should give us a token to use for authentication.
Probably worth looking into, but for the time being I'd still like to apply this FBR, as we are going to have some Outreachy activity happening on websites soon and we need to know that the prod build isn't broken.
-re
Rick
commit 657d050f6d699bc43973d968cd93d12131fca7f2 Author: Rick Elrod relrod@redhat.com Date: Thu Feb 27 05:29:24 2020 +0000
nagios: Add script and check for checking that a timestamp within
a file is within a delta of now, and then use this for alerting when websites stop building
Signed-off-by: Rick Elrod <relrod@redhat.com>
diff --git a/roles/nagios_client/files/scripts/check_timestamp_from_file b/roles/nagios_client/files/scripts/check_timestamp_from_file new file mode 100644 index 0000000..9064337 --- /dev/null +++ b/roles/nagios_client/files/scripts/check_timestamp_from_file @@ -0,0 +1,43 @@ +#!/usr/bin/env python
+# Takes a path to a file and a delta. The file must simply contain an
epoch
+# timestamp. It can be an integer or a float, as can the delta. +# +# Alerts critical if (now - timestamp contained in file) > delta. +# +# Rick Elrod relrod@redhat.com +# MIT
+import sys +import time
+if len(sys.argv) != 3:
- print('UNKNOWN: Pass path to file and delta as parameters')
- sys.exit(3)
+filename = sys.argv[1] +delta = float(sys.argv[2])
+timestamp = None
+try:
- with open(filename, 'r') as f:
timestamp = float(f.read().strip())
+except Exception as e:
- print('UNKNOWN: Unable to open/read file path')
- sys.exit(3)
+difference = round(time.time() - timestamp, 2) +if difference > delta:
- print(
'CRITICAL: Timestamp in file (%.2f) exceeds delta (%.2f) by
%.2f seconds' % (
timestamp,
delta,
difference - delta))
- sys.exit(2)
+print('OK: Timestamp in file (%.2f) is within delta (%.2f) of now, by %.2f seconds' % (
- timestamp,
- delta,
- abs(difference - delta)))
+sys.exit(0) diff --git a/roles/nagios_client/tasks/main.yml b/roles/nagios_client/tasks/main.yml index 2e5e0df..8e71a3b 100644 --- a/roles/nagios_client/tasks/main.yml +++ b/roles/nagios_client/tasks/main.yml @@ -47,6 +47,7 @@
- check_osbs_api.py
- check_ipa_replication
- check_redis_queue.sh
when: not inventory_hostname.startswith('noc') tags:
- check_timestamp_from_file
- nagios_client
@@ -226,6 +227,16 @@ tags:
- nagios_client
+- name: install nrpe checks for sundries/websites
- template: src={{ item }}.j2 dest=/etc/nrpe.d/{{ item }} owner=root
group=root mode=0644
- with_items:
- check_websites_buildtime.cfg
- when: inventory_hostname.startswith('sundries')
- notify:
- restart nrpe
- tags:
- nagios_client
- name: install nrpe config for the RabbitMQ checks template: src: "rabbitmq_args.ini.j2"
diff --git
a/roles/nagios_client/templates/check_websites_buildtime.cfg.j2
b/roles/nagios_client/templates/check_websites_buildtime.cfg.j2 new file mode 100644 index 0000000..ff5639d --- /dev/null +++ b/roles/nagios_client/templates/check_websites_buildtime.cfg.j2 @@ -0,0 +1,2 @@ +# Alert if websites haven't been built in 3 hours +command[check_websites_buildtime]={{ libdir }}/nagios/plugins/check_timestamp_from_file /srv/websites/getfedora.org/build.timestamp.txt 10800 diff --git
a/roles/nagios_server/templates/nagios/services/websites.cfg.j2
b/roles/nagios_server/templates/nagios/services/websites.cfg.j2 index 85e8f8e..c8958d7 100644 --- a/roles/nagios_server/templates/nagios/services/websites.cfg.j2 +++ b/roles/nagios_server/templates/nagios/services/websites.cfg.j2 @@ -316,4 +316,14 @@ define service { use ppc-secondarytemplate }
+## Auxillary to websites but necessary to make them happen
+define service {
- host_name sundries01.phx2.fedoraproject.org
- service_description websites build happened recently
- check_command check_by_nrpe!check_websites_buildtime
- use websitetemplate
+}
{% endif %} _______________________________________________ infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to
infrastructure-leave@lists.fedoraproject.org
Fedora Code of Conduct:
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives:
https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedorapro...
infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to
infrastructure-leave@lists.fedoraproject.org
Fedora Code of Conduct:
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives:
https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedorapro... _______________________________________________ infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedorapro...
On Thu, 27 Feb 2020 at 00:47, Rick Elrod codeblock@elrod.me wrote:
I'd like to apply the following which does:
- Adds a script I wrote for reading a timestamp from a file on disk
and alerting if the timestamp within it is NOT within a particular delta to now.
- Applies this to sundries01 and uses it to check
/srv/websites/getfedora.org/build.timestamp.txt which now gets generated as part of the websites build.
The purpose is because sometimes someone will commit something to the websites repo which breaks the build, but because of how we have things set up in openshift (cronjob), we don't get any kind of alert when that happens.
Right now this sets the delta to 3 hours. In theory it should be 1, but I figure let it try to build a few times before we start alerting.
Rick
Patch has been reviewed and looks correct for nagios and nrpe.
commit 657d050f6d699bc43973d968cd93d12131fca7f2 Author: Rick Elrod relrod@redhat.com Date: Thu Feb 27 05:29:24 2020 +0000
nagios: Add script and check for checking that a timestamp within
a file is within a delta of now, and then use this for alerting when websites stop building
Signed-off-by: Rick Elrod <relrod@redhat.com>
diff --git a/roles/nagios_client/files/scripts/check_timestamp_from_file b/roles/nagios_client/files/scripts/check_timestamp_from_file new file mode 100644 index 0000000..9064337 --- /dev/null +++ b/roles/nagios_client/files/scripts/check_timestamp_from_file @@ -0,0 +1,43 @@ +#!/usr/bin/env python
+# Takes a path to a file and a delta. The file must simply contain an epoch +# timestamp. It can be an integer or a float, as can the delta. +# +# Alerts critical if (now - timestamp contained in file) > delta. +# +# Rick Elrod relrod@redhat.com +# MIT
+import sys +import time
+if len(sys.argv) != 3:
- print('UNKNOWN: Pass path to file and delta as parameters')
- sys.exit(3)
+filename = sys.argv[1] +delta = float(sys.argv[2])
+timestamp = None
+try:
- with open(filename, 'r') as f:
timestamp = float(f.read().strip())
+except Exception as e:
- print('UNKNOWN: Unable to open/read file path')
- sys.exit(3)
+difference = round(time.time() - timestamp, 2) +if difference > delta:
- print(
'CRITICAL: Timestamp in file (%.2f) exceeds delta (%.2f) by
%.2f seconds' % (
timestamp,
delta,
difference - delta))
- sys.exit(2)
+print('OK: Timestamp in file (%.2f) is within delta (%.2f) of now, by %.2f seconds' % (
- timestamp,
- delta,
- abs(difference - delta)))
+sys.exit(0) diff --git a/roles/nagios_client/tasks/main.yml b/roles/nagios_client/tasks/main.yml index 2e5e0df..8e71a3b 100644 --- a/roles/nagios_client/tasks/main.yml +++ b/roles/nagios_client/tasks/main.yml @@ -47,6 +47,7 @@
- check_osbs_api.py
- check_ipa_replication
- check_redis_queue.sh
when: not inventory_hostname.startswith('noc') tags:
- check_timestamp_from_file
- nagios_client
@@ -226,6 +227,16 @@ tags:
- nagios_client
+- name: install nrpe checks for sundries/websites
- template: src={{ item }}.j2 dest=/etc/nrpe.d/{{ item }} owner=root
group=root mode=0644
- with_items:
- check_websites_buildtime.cfg
- when: inventory_hostname.startswith('sundries')
- notify:
- restart nrpe
- tags:
- nagios_client
- name: install nrpe config for the RabbitMQ checks template: src: "rabbitmq_args.ini.j2"
diff --git a/roles/nagios_client/templates/check_websites_buildtime.cfg.j2 b/roles/nagios_client/templates/check_websites_buildtime.cfg.j2 new file mode 100644 index 0000000..ff5639d --- /dev/null +++ b/roles/nagios_client/templates/check_websites_buildtime.cfg.j2 @@ -0,0 +1,2 @@ +# Alert if websites haven't been built in 3 hours +command[check_websites_buildtime]={{ libdir }}/nagios/plugins/check_timestamp_from_file /srv/websites/getfedora.org/build.timestamp.txt 10800 diff --git a/roles/nagios_server/templates/nagios/services/websites.cfg.j2 b/roles/nagios_server/templates/nagios/services/websites.cfg.j2 index 85e8f8e..c8958d7 100644 --- a/roles/nagios_server/templates/nagios/services/websites.cfg.j2 +++ b/roles/nagios_server/templates/nagios/services/websites.cfg.j2 @@ -316,4 +316,14 @@ define service { use ppc-secondarytemplate }
+## Auxillary to websites but necessary to make them happen
+define service {
- host_name sundries01.phx2.fedoraproject.org
- service_description websites build happened recently
- check_command check_by_nrpe!check_websites_buildtime
- use websitetemplate
+}
{% endif %} _______________________________________________ infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedorapro...
On Wed, Feb 26, 2020 at 11:46:55PM -0600, Rick Elrod wrote:
I'd like to apply the following which does:
- Adds a script I wrote for reading a timestamp from a file on disk
and alerting if the timestamp within it is NOT within a particular delta to now.
- Applies this to sundries01 and uses it to check
/srv/websites/getfedora.org/build.timestamp.txt which now gets generated as part of the websites build.
The purpose is because sometimes someone will commit something to the websites repo which breaks the build, but because of how we have things set up in openshift (cronjob), we don't get any kind of alert when that happens.
Right now this sets the delta to 3 hours. In theory it should be 1, but I figure let it try to build a few times before we start alerting.
+1
I agree we need larger fixes/monitoring here, but this is good for now..
kevin
infrastructure@lists.fedoraproject.org