I've been thinking about how we should handle nagios in the ansible world.
our current nagios config in puppet has a number of issues: 1. it's a bit cumbersome b/c you edit nagios independent of adding the host's config 2. when you remove a host the nagios config doesn't automatically go away 3. the fqdn/vpn hostname thing between noc01 and noc02 is kinda a giant pain in the ass 4. dependencies between Networks->vhosts->guests are manual and irritating to maintain.
I'm open to suggestions about how to maintain all of this. Here are some ideas I've tinkered with:
a. we stop putting nagios configs in the specific config mgmt entirely -and put it in another repo - like with dns. That doesn't make 1 or 2 any better - but it could allow us to script to make 3 and 4 much better
b. we make populating nagios configs for hosts or services be a function of playbooking the host/group creation. So all of the nagios configs go on when you add the host. - tht doesn't solve 2 or 4 but maybe it does handle 1 and 3 a bit.
c. we make the nagios configs generate from the host inventory data that ansible can retrieve. It will require us to define a series of additional variables per host or per group. So when you add a new host you'll need to wait for a cron run or an ansible run against our nagios hosts to get them to see the new hosts. With enough effort I think we can tag all of 1, 2, 3 and 4 in creating THE whole set of nagios configs that way and rsyncing them over using the ansible-rsync module (or just rsync). The problem with this one is that it seems like an all-or-nothing scenario - we need to drive ALL of our nagios configs off of this or none at all. With that in mind it seems like we would need to define hosts as part of ansible even if they are still being managed by puppet. That's extra work but I think it is work we'd have to do eventually.
So (c) would be something like this: - take the list of hosts - look for a vmhost or if it is a cloud instance - make that a dep - look for a datacenter - make that a dep - look for a vpn cert - make that a dep - and on up the chain. - look for any special service definitions that we'd be managing manually - put all of the hosts definitions in one big file so changing out that file can be idempotent - put service definitions in individual files - but have the files rsynced over with --delete so removing one gets removed on the nagios side, too
Anyone have an option D we should think about? I'd like to hear about more
Thanks, -sv
----- Original Message -----
From: "seth vidal" skvidal@fedoraproject.org To: "infrastructure" infrastructure@lists.fedoraproject.org Sent: Monday, June 17, 2013 1:35:21 PM Subject: nagios and ansible
I've been thinking about how we should handle nagios in the ansible world.
our current nagios config in puppet has a number of issues:
- it's a bit cumbersome b/c you edit nagios independent of adding the
host's config 2. when you remove a host the nagios config doesn't automatically go away 3. the fqdn/vpn hostname thing between noc01 and noc02 is kinda a giant pain in the ass 4. dependencies between Networks->vhosts->guests are manual and irritating to maintain.
I'm open to suggestions about how to maintain all of this. Here are some ideas I've tinkered with:
a. we stop putting nagios configs in the specific config mgmt entirely -and put it in another repo - like with dns. That doesn't make 1 or 2 any better - but it could allow us to script to make 3 and 4 much better
b. we make populating nagios configs for hosts or services be a function of playbooking the host/group creation. So all of the nagios configs go on when you add the host. - tht doesn't solve 2 or 4 but maybe it does handle 1 and 3 a bit.
c. we make the nagios configs generate from the host inventory data that ansible can retrieve. It will require us to define a series of additional variables per host or per group. So when you add a new host you'll need to wait for a cron run or an ansible run against our nagios hosts to get them to see the new hosts. With enough effort I think we can tag all of 1, 2, 3 and 4 in creating THE whole set of nagios configs that way and rsyncing them over using the ansible-rsync module (or just rsync). The problem with this one is that it seems like an all-or-nothing scenario - we need to drive ALL of our nagios configs off of this or none at all. With that in mind it seems like we would need to define hosts as part of ansible even if they are still being managed by puppet. That's extra work but I think it is work we'd have to do eventually.
So (c) would be something like this:
- take the list of hosts - look for a vmhost or if it is a cloud instance - make that a dep
- look for a datacenter - make that a dep
- look for a vpn cert - make that a dep
- and on up the chain.
- look for any special service definitions that we'd be managing manually
- put all of the hosts definitions in one big file so changing out that file can be idempotent
- put service definitions in individual files - but have the files rsynced over with --delete so removing one gets removed on the nagios side, too
Anyone have an option D we should think about? I'd like to hear about more
So I hate to say "what about not any of this as option D" but... have you looked at sensu at all? http://sensuapp.org
While I cannot answer half the questions you have above :) - I can somewhat confidently relate the following items:
1) Designed as ... nagios... for the cloud... but doesn't suck 2) Automagically detects new hosts 3) Plugs into rabbitmq (and thus hopefully fedbus - not sure what flavor of amqp we are using there?) 4) Can re-use existing nagios plugins 5) Events can be passed to handlers - either stuff like making pretty pictures (graphite), pagerduty, etc. or triggering something else to happen (scripts, etc) 6) Handles things like roles, users, etc; clients have (multiple) subscriptions, and will do checks based on what they subscribe to ("production" or "web" or "mailserver" or whatever you dream up) 7) Works with Puppet, Chef; I am not sure on ansible but there seems to be some random discussion on that capability when googling (admittedly I didn't look past "list of links")
I saw Joe Miller from Pantheon give a presentation on sensu a few weeks ago, slide link follows (2nd link) - they are using it, with Fedora, perhaps a reachout might be enlightening.... https://speakerdeck.com/joemiller/introduction-to-sensu https://speakerdeck.com/joemiller/practical-examples-with-sensu-monitoring-f...
Also: http://docs.sensuapp.org/0.9/overview.html
And, yes, ruby, meh. But checks, handlers can be written in any language. No clue on how well it would handle all the network things you list but it does seem fairly flexible.
Anyway: People seem to be pretty happy with it for cloudy things... since it was designed with that in mind. Might be worth at least playing with or reading about?
-r
Thanks, -sv
infrastructure mailing list infrastructure@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/infrastructure
On Mon, 17 Jun 2013 16:35:21 -0400 seth vidal skvidal@fedoraproject.org wrote:
I've been thinking about how we should handle nagios in the ansible world.
our current nagios config in puppet has a number of issues:
- it's a bit cumbersome b/c you edit nagios independent of adding the
host's config 2. when you remove a host the nagios config doesn't automatically go away 3. the fqdn/vpn hostname thing between noc01 and noc02 is kinda a giant pain in the ass
I think this actually should be fixed now? Or am I misremembering fixing it? (possible)
- dependencies between Networks->vhosts->guests are manual and
irritating to maintain.
Agreed on the rest for sure.
I'm open to suggestions about how to maintain all of this. Here are some ideas I've tinkered with:
a. we stop putting nagios configs in the specific config mgmt entirely -and put it in another repo - like with dns. That doesn't make 1 or 2 any better - but it could allow us to script to make 3 and 4 much better
b. we make populating nagios configs for hosts or services be a function of playbooking the host/group creation. So all of the nagios configs go on when you add the host. - tht doesn't solve 2 or 4 but maybe it does handle 1 and 3 a bit.
I like this idea, I think it does solve 4 a tad... as you would know what virthost and datacenter, etc from the ansible vars.
c. we make the nagios configs generate from the host inventory data that ansible can retrieve. It will require us to define a series of additional variables per host or per group. So when you add a new host you'll need to wait for a cron run or an ansible run against our nagios hosts to get them to see the new hosts. With enough effort I think we can tag all of 1, 2, 3 and 4 in creating THE whole set of nagios configs that way and rsyncing them over using the ansible-rsync module (or just rsync). The problem with this one is that it seems like an all-or-nothing scenario - we need to drive ALL of our nagios configs off of this or none at all. With that in mind it seems like we would need to define hosts as part of ansible even if they are still being managed by puppet. That's extra work but I think it is work we'd have to do eventually.
Yeah, we could also fire up a new noc03/04 to test with... leave the existing stuff around and test with this until the new one was 'good enough'.
So (c) would be something like this:
- take the list of hosts - look for a vmhost or if it is a cloud instance - make that a dep
What about non linux stuff? management interfaces and tape drives and such? I guess we need some extra task sugar that adds those in? we would need some kind of inventory for them thats not real inventory (since ansible wouldn't work against them).
- look for a datacenter - make that a dep
- look for a vpn cert - make that a dep
- and on up the chain.
- look for any special service definitions that we'd be managing manually
I guess we could just define these as a var per host/group that has the special checks?
- put all of the hosts definitions in one big file so changing out that file can be idempotent
- put service definitions in individual files - but have the files rsynced over with --delete so removing one gets removed on the nagios side, too
Yep. Seems reasonable.
Also, how about pushing the new config (when we do) and running 'nagios -v /etc/nagios/nagios.cfg' and if it fails, punt back and restore the old data and error. (we would have to save the old content, but that shouldn't be too hard).
I am Liking c more and more. ;)
Anyone have an option D we should think about? I'd like to hear about more
Well, we could always look around at !nagios again, but it's always depressing. ;) The one suggested downthread we could look at, but ruby will be a hard sell here.
I think any that wouldn't be configurable via ansble are no go really. Since we don't want to have to use some web interface or other manual junk to add hosts.
kevin
infrastructure@lists.fedoraproject.org