Hi!
It seems that there's no use to set any dampening for the alert tamplates, I got all alerts for any of them.
I usually use "one for several hours"... practically I only use templates, so the servers inherits the settings.
Is there any trick I miss? I have used 4.8 only a few days long, but as far as I can remember, it worked as I expected.
Attila
No one encounters similar? It is proven, that I can set anything in the dampening section, I will get each and every alerting mail! Very annoying, tbh...
What to chack if I set it correctly (I think there's nothing I can miss, it's quite simple - and I have also seen this work)? Or any log that should be correlated?
Attila
2013/11/8 Attila Heidrich attila.heidrich@gmail.com
Hi!
It seems that there's no use to set any dampening for the alert tamplates, I got all alerts for any of them.
I usually use "one for several hours"... practically I only use templates, so the servers inherits the settings.
Is there any trick I miss? I have used 4.8 only a few days long, but as far as I can remember, it worked as I expected.
Attila
Hi Attila, two things. First, what type of alert conditions are you using? Certain conditions, most notably Availability conditions like GOES DOWN, ignore dampening because they are discrete events. Second, there was a recent bug fix in the area of dampening. The hourly data purge was negating some of the dampening state, so dampening that crossed the top of the hour may have been affected. This fix will be in 4.10.
Do either of these things, maybe explain what you are seeing?
Also, you are saying that this is dampening at the alert template level? Check to ensure the dampening is also reflected at the resource level by looking at the alert definition that was propagated down from the template to one of the resources of the relevant type.
On 11/27/2013 4:35 AM, Attila Heidrich wrote:
No one encounters similar? It is proven, that I can set anything in the dampening section, I will get each and every alerting mail! Very annoying, tbh...
What to chack if I set it correctly (I think there's nothing I can miss, it's quite simple - and I have also seen this work)? Or any log that should be correlated?
Attila
2013/11/8 Attila Heidrich <attila.heidrich@gmail.com mailto:attila.heidrich@gmail.com>
Hi! It seems that there's no use to set any dampening for the alert tamplates, I got all alerts for any of them. I usually use "one for several hours"... practically I only use templates, so the servers inherits the settings. Is there any trick I miss? I have used 4.8 only a few days long, but as far as I can remember, it worked as I expected. Attila
rhq-users mailing list rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
Hi!
Thanks a lot, it does help me to understand things! My problem may be rooted in the concept of "discrete events"! anyway... is it possible, that the alert "process CPU consumption (percent) is higher than a given value" is affected by "being a discrete event"? I receive all alerts in every 20 minutes (or whatever the measurement interval is).
The templates are itemized correctly!
Regards,
Attila
2013/11/27 Jay Shaughnessy jshaughn@redhat.com
Hi Attila, two things. First, what type of alert conditions are you using? Certain conditions, most notably Availability conditions like GOES DOWN, ignore dampening because they are discrete events. Second, there was a recent bug fix in the area of dampening. The hourly data purge was negating some of the dampening state, so dampening that crossed the top of the hour may have been affected. This fix will be in 4.10.
Do either of these things, maybe explain what you are seeing?
Also, you are saying that this is dampening at the alert template level? Check to ensure the dampening is also reflected at the resource level by looking at the alert definition that was propagated down from the template to one of the resources of the relevant type.
On 11/27/2013 4:35 AM, Attila Heidrich wrote:
No one encounters similar? It is proven, that I can set anything in the dampening section, I will get each and every alerting mail! Very annoying, tbh...
What to chack if I set it correctly (I think there's nothing I can miss, it's quite simple - and I have also seen this work)? Or any log that should be correlated?
Attila
2013/11/8 Attila Heidrich attila.heidrich@gmail.com
Hi!
It seems that there's no use to set any dampening for the alert tamplates, I got all alerts for any of them.
I usually use "one for several hours"... practically I only use templates, so the servers inherits the settings.
Is there any trick I miss? I have used 4.8 only a few days long, but as far as I can remember, it worked as I expected.
Attila
rhq-users mailing listrhq-users@lists.fedorahosted.orghttps://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
No, all of the metric conditions should use dampening just fine. Availability is a bit different because of the way we only signal changes in availability and don't supply a constant stream of availability (this is for efficiency). But metrics are reported as scheduled, whether the value changes or not.
Does this happen only when the alert definition is derived from a template? If you define the exact same alert def at the resource level does the dampening work? That would be very strange but I'm just checking.
Can you post the exact alert def conditions and the dampening rule you have in effect.
On 11/27/2013 10:38 AM, Attila Heidrich wrote:
Hi!
Thanks a lot, it does help me to understand things! My problem may be rooted in the concept of "discrete events"! anyway... is it possible, that the alert "process CPU consumption (percent) is higher than a given value" is affected by "being a discrete event"? I receive all alerts in every 20 minutes (or whatever the measurement interval is).
The templates are itemized correctly!
Regards,
Attila
2013/11/27 Jay Shaughnessy <jshaughn@redhat.com mailto:jshaughn@redhat.com>
Hi Attila, two things. First, what type of alert conditions are you using? Certain conditions, most notably Availability conditions like GOES DOWN, ignore dampening because they are discrete events. Second, there was a recent bug fix in the area of dampening. The hourly data purge was negating some of the dampening state, so dampening that crossed the top of the hour may have been affected. This fix will be in 4.10. Do either of these things, maybe explain what you are seeing? Also, you are saying that this is dampening at the alert template level? Check to ensure the dampening is also reflected at the resource level by looking at the alert definition that was propagated down from the template to one of the resources of the relevant type. On 11/27/2013 4:35 AM, Attila Heidrich wrote:
No one encounters similar? It is proven, that I can set anything in the dampening section, I will get each and every alerting mail! Very annoying, tbh... What to chack if I set it correctly (I think there's nothing I can miss, it's quite simple - and I have also seen this work)? Or any log that should be correlated? Attila 2013/11/8 Attila Heidrich <attila.heidrich@gmail.com <mailto:attila.heidrich@gmail.com>> Hi! It seems that there's no use to set any dampening for the alert tamplates, I got all alerts for any of them. I usually use "one for several hours"... practically I only use templates, so the servers inherits the settings. Is there any trick I miss? I have used 4.8 only a few days long, but as far as I can remember, it worked as I expected. Attila _______________________________________________ rhq-users mailing list rhq-users@lists.fedorahosted.org <mailto:rhq-users@lists.fedorahosted.org> https://lists.fedorahosted.org/mailman/listinfo/rhq-users
_______________________________________________ rhq-users mailing list rhq-users@lists.fedorahosted.org <mailto:rhq-users@lists.fedorahosted.org> https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
It happens to all alerts, does not matter if it is derived from a template, or directly defined for a resource.
f.e. I had to define an alert for the postgres servers on the number of open files. This is defined as [Metric value threshold (Open file descriptors) > 10000] Dempening settings: Time period: 1 Occurence in 1 Hour.
I receive all alerts in every 10 minutes. I have also tried other dampening methods, but none worked for me.
Attila
2013/11/27 Jay Shaughnessy jshaughn@redhat.com
No, all of the metric conditions should use dampening just fine. Availability is a bit different because of the way we only signal changes in availability and don't supply a constant stream of availability (this is for efficiency). But metrics are reported as scheduled, whether the value changes or not.
Does this happen only when the alert definition is derived from a template? If you define the exact same alert def at the resource level does the dampening work? That would be very strange but I'm just checking.
Can you post the exact alert def conditions and the dampening rule you have in effect.
On 11/27/2013 10:38 AM, Attila Heidrich wrote:
Hi!
Thanks a lot, it does help me to understand things! My problem may be rooted in the concept of "discrete events"! anyway... is it possible, that the alert "process CPU consumption (percent) is higher than a given value" is affected by "being a discrete event"? I receive all alerts in every 20 minutes (or whatever the measurement interval is).
The templates are itemized correctly!
Regards,
Attila
2013/11/27 Jay Shaughnessy jshaughn@redhat.com
Hi Attila, two things. First, what type of alert conditions are you using? Certain conditions, most notably Availability conditions like GOES DOWN, ignore dampening because they are discrete events. Second, there was a recent bug fix in the area of dampening. The hourly data purge was negating some of the dampening state, so dampening that crossed the top of the hour may have been affected. This fix will be in 4.10.
Do either of these things, maybe explain what you are seeing?
Also, you are saying that this is dampening at the alert template level? Check to ensure the dampening is also reflected at the resource level by looking at the alert definition that was propagated down from the template to one of the resources of the relevant type.
On 11/27/2013 4:35 AM, Attila Heidrich wrote:
No one encounters similar? It is proven, that I can set anything in the dampening section, I will get each and every alerting mail! Very annoying, tbh...
What to chack if I set it correctly (I think there's nothing I can miss, it's quite simple - and I have also seen this work)? Or any log that should be correlated?
Attila
2013/11/8 Attila Heidrich attila.heidrich@gmail.com
Hi!
It seems that there's no use to set any dampening for the alert tamplates, I got all alerts for any of them.
I usually use "one for several hours"... practically I only use templates, so the servers inherits the settings.
Is there any trick I miss? I have used 4.8 only a few days long, but as far as I can remember, it worked as I expected.
Attila
rhq-users mailing listrhq-users@lists.fedorahosted.orghttps://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing listrhq-users@lists.fedorahosted.orghttps://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
Attila,
OK, looking at your dampening definition this is actually a misunderstanding of dampening, the behavior is as expected (even if not as desired). When you say "N Occurrence in X Hours" that means that if the conditions are met N times in X hours that you will generate an alert. If N = 1 the dampening is effectively useless. That just means the conditions have to match 1 time in X hours. As soon as they match you get the alert. After the alert fires the whole thing *resets*. The intent of that dampening rule is to avoid alerting on spikes in activity. Instead, alert only if a certain aberration is consistent.
For example, if "number of open files" sometimes spikes you may not want an alert. But if say you are taking measurements every 10 minutes, and 3 times in an hour you see the same spike , maybe there is a problem. In that case you'd say, "3 occurrences in 1 Hour".
What you want is "Fire no more than 1 times in an hour". We don't really have that sort of dampening, although it may be a nice RFE for a time-based recovery.
We have various dampening, we have the ability to disable-after-fire and we have recovery alerts. Perhaps you will find something that works for you, otherwise feel free to create an RFE BZ with your exact requirement.
For more on dampening and recovery alerting see https://docs.jboss.org/author/display/RHQ/Alerts#Alerts-Dampening.
On 11/28/2013 3:21 AM, Attila Heidrich wrote:
It happens to all alerts, does not matter if it is derived from a template, or directly defined for a resource.
f.e. I had to define an alert for the postgres servers on the number of open files. This is defined as [Metric value threshold (Open file descriptors) > 10000] Dempening settings: Time period: 1 Occurence in 1 Hour.
I receive all alerts in every 10 minutes. I have also tried other dampening methods, but none worked for me.
Attila
2013/11/27 Jay Shaughnessy <jshaughn@redhat.com mailto:jshaughn@redhat.com>
No, all of the metric conditions should use dampening just fine. Availability is a bit different because of the way we only signal changes in availability and don't supply a constant stream of availability (this is for efficiency). But metrics are reported as scheduled, whether the value changes or not. Does this happen only when the alert definition is derived from a template? If you define the exact same alert def at the resource level does the dampening work? That would be very strange but I'm just checking. Can you post the exact alert def conditions and the dampening rule you have in effect. On 11/27/2013 10:38 AM, Attila Heidrich wrote:
Hi! Thanks a lot, it does help me to understand things! My problem may be rooted in the concept of "discrete events"! anyway... is it possible, that the alert "process CPU consumption (percent) is higher than a given value" is affected by "being a discrete event"? I receive all alerts in every 20 minutes (or whatever the measurement interval is). The templates are itemized correctly! Regards, Attila 2013/11/27 Jay Shaughnessy <jshaughn@redhat.com <mailto:jshaughn@redhat.com>> Hi Attila, two things. First, what type of alert conditions are you using? Certain conditions, most notably Availability conditions like GOES DOWN, ignore dampening because they are discrete events. Second, there was a recent bug fix in the area of dampening. The hourly data purge was negating some of the dampening state, so dampening that crossed the top of the hour may have been affected. This fix will be in 4.10. Do either of these things, maybe explain what you are seeing? Also, you are saying that this is dampening at the alert template level? Check to ensure the dampening is also reflected at the resource level by looking at the alert definition that was propagated down from the template to one of the resources of the relevant type. On 11/27/2013 4:35 AM, Attila Heidrich wrote:
No one encounters similar? It is proven, that I can set anything in the dampening section, I will get each and every alerting mail! Very annoying, tbh... What to chack if I set it correctly (I think there's nothing I can miss, it's quite simple - and I have also seen this work)? Or any log that should be correlated? Attila 2013/11/8 Attila Heidrich <attila.heidrich@gmail.com <mailto:attila.heidrich@gmail.com>> Hi! It seems that there's no use to set any dampening for the alert tamplates, I got all alerts for any of them. I usually use "one for several hours"... practically I only use templates, so the servers inherits the settings. Is there any trick I miss? I have used 4.8 only a few days long, but as far as I can remember, it worked as I expected. Attila _______________________________________________ rhq-users mailing list rhq-users@lists.fedorahosted.org <mailto:rhq-users@lists.fedorahosted.org> https://lists.fedorahosted.org/mailman/listinfo/rhq-users
_______________________________________________ rhq-users mailing list rhq-users@lists.fedorahosted.org <mailto:rhq-users@lists.fedorahosted.org> https://lists.fedorahosted.org/mailman/listinfo/rhq-users _______________________________________________ rhq-users mailing list rhq-users@lists.fedorahosted.org <mailto:rhq-users@lists.fedorahosted.org> https://lists.fedorahosted.org/mailman/listinfo/rhq-users
_______________________________________________ rhq-users mailing list rhq-users@lists.fedorahosted.org <mailto:rhq-users@lists.fedorahosted.org> https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
Thanks really!
It did help me to understand the way it works. Unfortunately I do not perfectly understand the idea behind. It delays the alerts significantly, and we here usually monitor parameters which does not really recover by-itself.
We usually want:
- monitor frequently - alert ASAP - do not repeat the same alert for all the trigger event
Even the simplest syslog apps can do this. Doing the other way round could also be interesting for many types of services ... but in many cases there is just no chance that things get better without action - but it's just strange to get the same alert mail in each and every 10/20 minutes - or delay the alert - and no other way.
Thanks again!
Regards,
Attila
2013/12/2 Jay Shaughnessy jshaughn@redhat.com
Attila,
OK, looking at your dampening definition this is actually a misunderstanding of dampening, the behavior is as expected (even if not as desired). When you say "N Occurrence in X Hours" that means that if the conditions are met N times in X hours that you will generate an alert. If N = 1 the dampening is effectively useless. That just means the conditions have to match 1 time in X hours. As soon as they match you get the alert. After the alert fires the whole thing *resets*. The intent of that dampening rule is to avoid alerting on spikes in activity. Instead, alert only if a certain aberration is consistent.
For example, if "number of open files" sometimes spikes you may not want an alert. But if say you are taking measurements every 10 minutes, and 3 times in an hour you see the same spike , maybe there is a problem. In that case you'd say, "3 occurrences in 1 Hour".
What you want is "Fire no more than 1 times in an hour". We don't really have that sort of dampening, although it may be a nice RFE for a time-based recovery.
We have various dampening, we have the ability to disable-after-fire and we have recovery alerts. Perhaps you will find something that works for you, otherwise feel free to create an RFE BZ with your exact requirement.
For more on dampening and recovery alerting see https://docs.jboss.org/author/display/RHQ/Alerts#Alerts-Dampening.
On 11/28/2013 3:21 AM, Attila Heidrich wrote:
It happens to all alerts, does not matter if it is derived from a template, or directly defined for a resource.
f.e. I had to define an alert for the postgres servers on the number of open files. This is defined as [Metric value threshold (Open file descriptors) > 10000] Dempening settings: Time period: 1 Occurence in 1 Hour.
I receive all alerts in every 10 minutes. I have also tried other dampening methods, but none worked for me.
Attila
2013/11/27 Jay Shaughnessy jshaughn@redhat.com
No, all of the metric conditions should use dampening just fine. Availability is a bit different because of the way we only signal changes in availability and don't supply a constant stream of availability (this is for efficiency). But metrics are reported as scheduled, whether the value changes or not.
Does this happen only when the alert definition is derived from a template? If you define the exact same alert def at the resource level does the dampening work? That would be very strange but I'm just checking.
Can you post the exact alert def conditions and the dampening rule you have in effect.
On 11/27/2013 10:38 AM, Attila Heidrich wrote:
Hi!
Thanks a lot, it does help me to understand things! My problem may be rooted in the concept of "discrete events"! anyway... is it possible, that the alert "process CPU consumption (percent) is higher than a given value" is affected by "being a discrete event"? I receive all alerts in every 20 minutes (or whatever the measurement interval is).
The templates are itemized correctly!
Regards,
Attila
2013/11/27 Jay Shaughnessy jshaughn@redhat.com
Hi Attila, two things. First, what type of alert conditions are you using? Certain conditions, most notably Availability conditions like GOES DOWN, ignore dampening because they are discrete events. Second, there was a recent bug fix in the area of dampening. The hourly data purge was negating some of the dampening state, so dampening that crossed the top of the hour may have been affected. This fix will be in 4.10.
Do either of these things, maybe explain what you are seeing?
Also, you are saying that this is dampening at the alert template level? Check to ensure the dampening is also reflected at the resource level by looking at the alert definition that was propagated down from the template to one of the resources of the relevant type.
On 11/27/2013 4:35 AM, Attila Heidrich wrote:
No one encounters similar? It is proven, that I can set anything in the dampening section, I will get each and every alerting mail! Very annoying, tbh...
What to chack if I set it correctly (I think there's nothing I can miss, it's quite simple - and I have also seen this work)? Or any log that should be correlated?
Attila
2013/11/8 Attila Heidrich attila.heidrich@gmail.com
Hi!
It seems that there's no use to set any dampening for the alert tamplates, I got all alerts for any of them.
I usually use "one for several hours"... practically I only use templates, so the servers inherits the settings.
Is there any trick I miss? I have used 4.8 only a few days long, but as far as I can remember, it worked as I expected.
Attila
rhq-users mailing listrhq-users@lists.fedorahosted.orghttps://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing listrhq-users@lists.fedorahosted.orghttps://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing listrhq-users@lists.fedorahosted.orghttps://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
Anyway... I can use the "Disable when fired" option - as you mentioned it also.
Attila
2013/12/3 Attila Heidrich attila.heidrich@gmail.com
Thanks really!
It did help me to understand the way it works. Unfortunately I do not perfectly understand the idea behind. It delays the alerts significantly, and we here usually monitor parameters which does not really recover by-itself.
We usually want:
- monitor frequently
- alert ASAP
- do not repeat the same alert for all the trigger event
Even the simplest syslog apps can do this. Doing the other way round could also be interesting for many types of services ... but in many cases there is just no chance that things get better without action - but it's just strange to get the same alert mail in each and every 10/20 minutes - or delay the alert - and no other way.
Thanks again!
Regards,
Attila
2013/12/2 Jay Shaughnessy jshaughn@redhat.com
Attila,
OK, looking at your dampening definition this is actually a misunderstanding of dampening, the behavior is as expected (even if not as desired). When you say "N Occurrence in X Hours" that means that if the conditions are met N times in X hours that you will generate an alert. If N = 1 the dampening is effectively useless. That just means the conditions have to match 1 time in X hours. As soon as they match you get the alert. After the alert fires the whole thing *resets*. The intent of that dampening rule is to avoid alerting on spikes in activity. Instead, alert only if a certain aberration is consistent.
For example, if "number of open files" sometimes spikes you may not want an alert. But if say you are taking measurements every 10 minutes, and 3 times in an hour you see the same spike , maybe there is a problem. In that case you'd say, "3 occurrences in 1 Hour".
What you want is "Fire no more than 1 times in an hour". We don't really have that sort of dampening, although it may be a nice RFE for a time-based recovery.
We have various dampening, we have the ability to disable-after-fire and we have recovery alerts. Perhaps you will find something that works for you, otherwise feel free to create an RFE BZ with your exact requirement.
For more on dampening and recovery alerting see https://docs.jboss.org/author/display/RHQ/Alerts#Alerts-Dampening.
On 11/28/2013 3:21 AM, Attila Heidrich wrote:
It happens to all alerts, does not matter if it is derived from a template, or directly defined for a resource.
f.e. I had to define an alert for the postgres servers on the number of open files. This is defined as [Metric value threshold (Open file descriptors) > 10000] Dempening settings: Time period: 1 Occurence in 1 Hour.
I receive all alerts in every 10 minutes. I have also tried other dampening methods, but none worked for me.
Attila
2013/11/27 Jay Shaughnessy jshaughn@redhat.com
No, all of the metric conditions should use dampening just fine. Availability is a bit different because of the way we only signal changes in availability and don't supply a constant stream of availability (this is for efficiency). But metrics are reported as scheduled, whether the value changes or not.
Does this happen only when the alert definition is derived from a template? If you define the exact same alert def at the resource level does the dampening work? That would be very strange but I'm just checking.
Can you post the exact alert def conditions and the dampening rule you have in effect.
On 11/27/2013 10:38 AM, Attila Heidrich wrote:
Hi!
Thanks a lot, it does help me to understand things! My problem may be rooted in the concept of "discrete events"! anyway... is it possible, that the alert "process CPU consumption (percent) is higher than a given value" is affected by "being a discrete event"? I receive all alerts in every 20 minutes (or whatever the measurement interval is).
The templates are itemized correctly!
Regards,
Attila
2013/11/27 Jay Shaughnessy jshaughn@redhat.com
Hi Attila, two things. First, what type of alert conditions are you using? Certain conditions, most notably Availability conditions like GOES DOWN, ignore dampening because they are discrete events. Second, there was a recent bug fix in the area of dampening. The hourly data purge was negating some of the dampening state, so dampening that crossed the top of the hour may have been affected. This fix will be in 4.10.
Do either of these things, maybe explain what you are seeing?
Also, you are saying that this is dampening at the alert template level? Check to ensure the dampening is also reflected at the resource level by looking at the alert definition that was propagated down from the template to one of the resources of the relevant type.
On 11/27/2013 4:35 AM, Attila Heidrich wrote:
No one encounters similar? It is proven, that I can set anything in the dampening section, I will get each and every alerting mail! Very annoying, tbh...
What to chack if I set it correctly (I think there's nothing I can miss, it's quite simple - and I have also seen this work)? Or any log that should be correlated?
Attila
2013/11/8 Attila Heidrich attila.heidrich@gmail.com
Hi!
It seems that there's no use to set any dampening for the alert tamplates, I got all alerts for any of them.
I usually use "one for several hours"... practically I only use templates, so the servers inherits the settings.
Is there any trick I miss? I have used 4.8 only a few days long, but as far as I can remember, it worked as I expected.
Attila
rhq-users mailing listrhq-users@lists.fedorahosted.orghttps://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing listrhq-users@lists.fedorahosted.orghttps://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing listrhq-users@lists.fedorahosted.orghttps://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
You should look into recovery alerting. When you set disable-on-fire=true it means you will not get any more of that alert until it is re-enabled. That means manual intervention to reset the alert definitions to enabled after the issue is resolved. Recovery alerts can automate this by defining conditions (typically the opposite of what fired the original alert) that when met will re-enable the original alert definition.
On 12/3/2013 4:05 AM, Attila Heidrich wrote:
Anyway... I can use the "Disable when fired" option - as you mentioned it also.
Attila
2013/12/3 Attila Heidrich <attila.heidrich@gmail.com mailto:attila.heidrich@gmail.com>
Thanks really! It did help me to understand the way it works. Unfortunately I do not perfectly understand the idea behind. It delays the alerts significantly, and we here usually monitor parameters which does not really recover by-itself. We usually want: - monitor frequently - alert ASAP - do not repeat the same alert for all the trigger event Even the simplest syslog apps can do this. Doing the other way round could also be interesting for many types of services ... but in many cases there is just no chance that things get better without action - but it's just strange to get the same alert mail in each and every 10/20 minutes - or delay the alert - and no other way. Thanks again! Regards, Attila 2013/12/2 Jay Shaughnessy <jshaughn@redhat.com <mailto:jshaughn@redhat.com>> Attila, OK, looking at your dampening definition this is actually a misunderstanding of dampening, the behavior is as expected (even if not as desired). When you say "N Occurrence in X Hours" that means that if the conditions are met N times in X hours that you will generate an alert. If N = 1 the dampening is effectively useless. That just means the conditions have to match 1 time in X hours. As soon as they match you get the alert. After the alert fires the whole thing *resets*. The intent of that dampening rule is to avoid alerting on spikes in activity. Instead, alert only if a certain aberration is consistent. For example, if "number of open files" sometimes spikes you may not want an alert. But if say you are taking measurements every 10 minutes, and 3 times in an hour you see the same spike , maybe there is a problem. In that case you'd say, "3 occurrences in 1 Hour". What you want is "Fire no more than 1 times in an hour". We don't really have that sort of dampening, although it may be a nice RFE for a time-based recovery. We have various dampening, we have the ability to disable-after-fire and we have recovery alerts. Perhaps you will find something that works for you, otherwise feel free to create an RFE BZ with your exact requirement. For more on dampening and recovery alerting see https://docs.jboss.org/author/display/RHQ/Alerts#Alerts-Dampening. On 11/28/2013 3:21 AM, Attila Heidrich wrote:
It happens to all alerts, does not matter if it is derived from a template, or directly defined for a resource. f.e. I had to define an alert for the postgres servers on the number of open files. This is defined as [Metric value threshold (Open file descriptors) > 10000] Dempening settings: Time period: 1 Occurence in 1 Hour. I receive all alerts in every 10 minutes. I have also tried other dampening methods, but none worked for me. Attila 2013/11/27 Jay Shaughnessy <jshaughn@redhat.com <mailto:jshaughn@redhat.com>> No, all of the metric conditions should use dampening just fine. Availability is a bit different because of the way we only signal changes in availability and don't supply a constant stream of availability (this is for efficiency). But metrics are reported as scheduled, whether the value changes or not. Does this happen only when the alert definition is derived from a template? If you define the exact same alert def at the resource level does the dampening work? That would be very strange but I'm just checking. Can you post the exact alert def conditions and the dampening rule you have in effect. On 11/27/2013 10:38 AM, Attila Heidrich wrote:
Hi! Thanks a lot, it does help me to understand things! My problem may be rooted in the concept of "discrete events"! anyway... is it possible, that the alert "process CPU consumption (percent) is higher than a given value" is affected by "being a discrete event"? I receive all alerts in every 20 minutes (or whatever the measurement interval is). The templates are itemized correctly! Regards, Attila 2013/11/27 Jay Shaughnessy <jshaughn@redhat.com <mailto:jshaughn@redhat.com>> Hi Attila, two things. First, what type of alert conditions are you using? Certain conditions, most notably Availability conditions like GOES DOWN, ignore dampening because they are discrete events. Second, there was a recent bug fix in the area of dampening. The hourly data purge was negating some of the dampening state, so dampening that crossed the top of the hour may have been affected. This fix will be in 4.10. Do either of these things, maybe explain what you are seeing? Also, you are saying that this is dampening at the alert template level? Check to ensure the dampening is also reflected at the resource level by looking at the alert definition that was propagated down from the template to one of the resources of the relevant type. On 11/27/2013 4:35 AM, Attila Heidrich wrote:
No one encounters similar? It is proven, that I can set anything in the dampening section, I will get each and every alerting mail! Very annoying, tbh... What to chack if I set it correctly (I think there's nothing I can miss, it's quite simple - and I have also seen this work)? Or any log that should be correlated? Attila 2013/11/8 Attila Heidrich <attila.heidrich@gmail.com <mailto:attila.heidrich@gmail.com>> Hi! It seems that there's no use to set any dampening for the alert tamplates, I got all alerts for any of them. I usually use "one for several hours"... practically I only use templates, so the servers inherits the settings. Is there any trick I miss? I have used 4.8 only a few days long, but as far as I can remember, it worked as I expected. Attila _______________________________________________ rhq-users mailing list rhq-users@lists.fedorahosted.org <mailto:rhq-users@lists.fedorahosted.org> https://lists.fedorahosted.org/mailman/listinfo/rhq-users
_______________________________________________ rhq-users mailing list rhq-users@lists.fedorahosted.org <mailto:rhq-users@lists.fedorahosted.org> https://lists.fedorahosted.org/mailman/listinfo/rhq-users _______________________________________________ rhq-users mailing list rhq-users@lists.fedorahosted.org <mailto:rhq-users@lists.fedorahosted.org> https://lists.fedorahosted.org/mailman/listinfo/rhq-users
_______________________________________________ rhq-users mailing list rhq-users@lists.fedorahosted.org <mailto:rhq-users@lists.fedorahosted.org> https://lists.fedorahosted.org/mailman/listinfo/rhq-users _______________________________________________ rhq-users mailing list rhq-users@lists.fedorahosted.org <mailto:rhq-users@lists.fedorahosted.org> https://lists.fedorahosted.org/mailman/listinfo/rhq-users
_______________________________________________ rhq-users mailing list rhq-users@lists.fedorahosted.org <mailto:rhq-users@lists.fedorahosted.org> https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
OK!
I hope it works as I expect :)
I define for example
1. - alert if (open files) > 10000 - this stands for consecutive 2 measurements (optional) - disabe when fired
2. - alert if (open files) < 10000 - this stands for consecutive 2 measurements (optional) - Recover alert: 1st one
In this case the 2nd will re-enable the first one?
Attila
2013/12/3 Jay Shaughnessy jshaughn@redhat.com
You should look into recovery alerting. When you set disable-on-fire=true it means you will not get any more of that alert until it is re-enabled. That means manual intervention to reset the alert definitions to enabled after the issue is resolved. Recovery alerts can automate this by defining conditions (typically the opposite of what fired the original alert) that when met will re-enable the original alert definition.
On 12/3/2013 4:05 AM, Attila Heidrich wrote:
Anyway... I can use the "Disable when fired" option - as you mentioned it also.
Attila
2013/12/3 Attila Heidrich attila.heidrich@gmail.com
Thanks really!
It did help me to understand the way it works. Unfortunately I do not perfectly understand the idea behind. It delays the alerts significantly, and we here usually monitor parameters which does not really recover by-itself.
We usually want:
- monitor frequently
- alert ASAP
- do not repeat the same alert for all the trigger event
Even the simplest syslog apps can do this. Doing the other way round could also be interesting for many types of services ... but in many cases there is just no chance that things get better without action - but it's just strange to get the same alert mail in each and every 10/20 minutes - or delay the alert - and no other way.
Thanks again!
Regards,
Attila
2013/12/2 Jay Shaughnessy jshaughn@redhat.com
Attila,
OK, looking at your dampening definition this is actually a misunderstanding of dampening, the behavior is as expected (even if not as desired). When you say "N Occurrence in X Hours" that means that if the conditions are met N times in X hours that you will generate an alert. If N = 1 the dampening is effectively useless. That just means the conditions have to match 1 time in X hours. As soon as they match you get the alert. After the alert fires the whole thing *resets*. The intent of that dampening rule is to avoid alerting on spikes in activity. Instead, alert only if a certain aberration is consistent.
For example, if "number of open files" sometimes spikes you may not want an alert. But if say you are taking measurements every 10 minutes, and 3 times in an hour you see the same spike , maybe there is a problem. In that case you'd say, "3 occurrences in 1 Hour".
What you want is "Fire no more than 1 times in an hour". We don't really have that sort of dampening, although it may be a nice RFE for a time-based recovery.
We have various dampening, we have the ability to disable-after-fire and we have recovery alerts. Perhaps you will find something that works for you, otherwise feel free to create an RFE BZ with your exact requirement.
For more on dampening and recovery alerting see https://docs.jboss.org/author/display/RHQ/Alerts#Alerts-Dampening.
On 11/28/2013 3:21 AM, Attila Heidrich wrote:
It happens to all alerts, does not matter if it is derived from a template, or directly defined for a resource.
f.e. I had to define an alert for the postgres servers on the number of open files. This is defined as [Metric value threshold (Open file descriptors) > 10000] Dempening settings: Time period: 1 Occurence in 1 Hour.
I receive all alerts in every 10 minutes. I have also tried other dampening methods, but none worked for me.
Attila
2013/11/27 Jay Shaughnessy jshaughn@redhat.com
No, all of the metric conditions should use dampening just fine. Availability is a bit different because of the way we only signal changes in availability and don't supply a constant stream of availability (this is for efficiency). But metrics are reported as scheduled, whether the value changes or not.
Does this happen only when the alert definition is derived from a template? If you define the exact same alert def at the resource level does the dampening work? That would be very strange but I'm just checking.
Can you post the exact alert def conditions and the dampening rule you have in effect.
On 11/27/2013 10:38 AM, Attila Heidrich wrote:
Hi!
Thanks a lot, it does help me to understand things! My problem may be rooted in the concept of "discrete events"! anyway... is it possible, that the alert "process CPU consumption (percent) is higher than a given value" is affected by "being a discrete event"? I receive all alerts in every 20 minutes (or whatever the measurement interval is).
The templates are itemized correctly!
Regards,
Attila
2013/11/27 Jay Shaughnessy jshaughn@redhat.com
Hi Attila, two things. First, what type of alert conditions are you using? Certain conditions, most notably Availability conditions like GOES DOWN, ignore dampening because they are discrete events. Second, there was a recent bug fix in the area of dampening. The hourly data purge was negating some of the dampening state, so dampening that crossed the top of the hour may have been affected. This fix will be in 4.10.
Do either of these things, maybe explain what you are seeing?
Also, you are saying that this is dampening at the alert template level? Check to ensure the dampening is also reflected at the resource level by looking at the alert definition that was propagated down from the template to one of the resources of the relevant type.
On 11/27/2013 4:35 AM, Attila Heidrich wrote:
No one encounters similar? It is proven, that I can set anything in the dampening section, I will get each and every alerting mail! Very annoying, tbh...
What to chack if I set it correctly (I think there's nothing I can miss, it's quite simple - and I have also seen this work)? Or any log that should be correlated?
Attila
2013/11/8 Attila Heidrich attila.heidrich@gmail.com
Hi!
It seems that there's no use to set any dampening for the alert tamplates, I got all alerts for any of them.
I usually use "one for several hours"... practically I only use templates, so the servers inherits the settings.
Is there any trick I miss? I have used 4.8 only a few days long, but as far as I can remember, it worked as I expected.
Attila
rhq-users mailing listrhq-users@lists.fedorahosted.orghttps://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing listrhq-users@lists.fedorahosted.orghttps://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing listrhq-users@lists.fedorahosted.orghttps://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing listrhq-users@lists.fedorahosted.orghttps://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
Has there ever been discussion of building the recovery concept into a dampening definition? Maybe "transition" dampening which fires just when state changes? In either direction.
Beyond Attila's overlooking the concept, which I think is probably common, it's a usability issue for us that does not scale well. Most all of our alerts we'd prefer only alert us once for entry and once for exit. With the current arrangement, getting alerts for that basic arrangement can mean 4 separate definitions for basically 1 condition. (correct if I'm wrong)
1 - Email alert on Availability Down, disable when fired 2 - Recovery alert on Availability Up, enabling #1 1 - Email alert on Availability Up, disable when fired 2 - Recovery alert on Availability Down, enabling #2
We basically just don't do this because it's not feasible to define 4x the alerts. But we'd love to have that functionality.
Thanks for all the hard work,
Matt
OK!
I hope it works as I expect :)
I define for example
1. - alert if (open files) > 10000 - this stands for consecutive 2 measurements (optional) - disabe when fired
2. - alert if (open files) < 10000 - this stands for consecutive 2 measurements (optional) - Recover alert: 1st one
In this case the 2nd will re-enable the first one?
Attila
2013/12/3 Jay Shaughnessy <jshaughn@redhat.commailto:jshaughn@redhat.com>
You should look into recovery alerting. When you set disable-on-fire=true it means you will not get any more of that alert until it is re-enabled. That means manual intervention to reset the alert definitions to enabled after the issue is resolved. Recovery alerts can automate this by defining conditions (typically the opposite of what fired the original alert) that when met will re-enable the original alert definition.
On 12/3/2013 4:05 AM, Attila Heidrich wrote: Anyway... I can use the "Disable when fired" option - as you mentioned it also.
Attila
2013/12/3 Attila Heidrich <attila.heidrich@gmail.commailto:attila.heidrich@gmail.com> Thanks really!
It did help me to understand the way it works. Unfortunately I do not perfectly understand the idea behind. It delays the alerts significantly, and we here usually monitor parameters which does not really recover by-itself.
We usually want:
- monitor frequently - alert ASAP - do not repeat the same alert for all the trigger event
Even the simplest syslog apps can do this. Doing the other way round could also be interesting for many types of services ... but in many cases there is just no chance that things get better without action - but it's just strange to get the same alert mail in each and every 10/20 minutes - or delay the alert - and no other way.
Thanks again!
Regards,
Attila
2013/12/2 Jay Shaughnessy <jshaughn@redhat.commailto:jshaughn@redhat.com>
Attila,
OK, looking at your dampening definition this is actually a misunderstanding of dampening, the behavior is as expected (even if not as desired). When you say "N Occurrence in X Hours" that means that if the conditions are met N times in X hours that you will generate an alert. If N = 1 the dampening is effectively useless. That just means the conditions have to match 1 time in X hours. As soon as they match you get the alert. After the alert fires the whole thing *resets*. The intent of that dampening rule is to avoid alerting on spikes in activity. Instead, alert only if a certain aberration is consistent.
For example, if "number of open files" sometimes spikes you may not want an alert. But if say you are taking measurements every 10 minutes, and 3 times in an hour you see the same spike , maybe there is a problem. In that case you'd say, "3 occurrences in 1 Hour".
What you want is "Fire no more than 1 times in an hour". We don't really have that sort of dampening, although it may be a nice RFE for a time-based recovery.
We have various dampening, we have the ability to disable-after-fire and we have recovery alerts. Perhaps you will find something that works for you, otherwise feel free to create an RFE BZ with your exact requirement.
For more on dampening and recovery alerting see https://docs.jboss.org/author/display/RHQ/Alerts#Alerts-Dampening.
On 11/28/2013 3:21 AM, Attila Heidrich wrote: It happens to all alerts, does not matter if it is derived from a template, or directly defined for a resource.
f.e. I had to define an alert for the postgres servers on the number of open files. This is defined as [Metric value threshold (Open file descriptors) > 10000] Dempening settings: Time period: 1 Occurence in 1 Hour.
I receive all alerts in every 10 minutes. I have also tried other dampening methods, but none worked for me.
Attila
2013/11/27 Jay Shaughnessy <jshaughn@redhat.commailto:jshaughn@redhat.com>
No, all of the metric conditions should use dampening just fine. Availability is a bit different because of the way we only signal changes in availability and don't supply a constant stream of availability (this is for efficiency). But metrics are reported as scheduled, whether the value changes or not.
Does this happen only when the alert definition is derived from a template? If you define the exact same alert def at the resource level does the dampening work? That would be very strange but I'm just checking.
Can you post the exact alert def conditions and the dampening rule you have in effect.
On 11/27/2013 10:38 AM, Attila Heidrich wrote: Hi!
Thanks a lot, it does help me to understand things! My problem may be rooted in the concept of "discrete events"! anyway... is it possible, that the alert "process CPU consumption (percent) is higher than a given value" is affected by "being a discrete event"? I receive all alerts in every 20 minutes (or whatever the measurement interval is).
The templates are itemized correctly!
Regards,
Attila
2013/11/27 Jay Shaughnessy <jshaughn@redhat.commailto:jshaughn@redhat.com>
Hi Attila, two things. First, what type of alert conditions are you using? Certain conditions, most notably Availability conditions like GOES DOWN, ignore dampening because they are discrete events. Second, there was a recent bug fix in the area of dampening. The hourly data purge was negating some of the dampening state, so dampening that crossed the top of the hour may have been affected. This fix will be in 4.10.
Do either of these things, maybe explain what you are seeing?
Also, you are saying that this is dampening at the alert template level? Check to ensure the dampening is also reflected at the resource level by looking at the alert definition that was propagated down from the template to one of the resources of the relevant type.
On 11/27/2013 4:35 AM, Attila Heidrich wrote: No one encounters similar? It is proven, that I can set anything in the dampening section, I will get each and every alerting mail! Very annoying, tbh...
What to chack if I set it correctly (I think there's nothing I can miss, it's quite simple - and I have also seen this work)? Or any log that should be correlated?
Attila
2013/11/8 Attila Heidrich <attila.heidrich@gmail.commailto:attila.heidrich@gmail.com> Hi!
It seems that there's no use to set any dampening for the alert tamplates, I got all alerts for any of them.
I usually use "one for several hours"... practically I only use templates, so the servers inherits the settings.
Is there any trick I miss? I have used 4.8 only a few days long, but as far as I can remember, it worked as I expected.
Attila
_______________________________________________ rhq-users mailing list rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.orghttps://lists.fedorahosted.org/mailman/listinfo/rhq-users
_______________________________________________ rhq-users mailing list rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
_______________________________________________ rhq-users mailing list rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.orghttps://lists.fedorahosted.org/mailman/listinfo/rhq-users
_______________________________________________ rhq-users mailing list rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
_______________________________________________ rhq-users mailing list rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.orghttps://lists.fedorahosted.org/mailman/listinfo/rhq-users
_______________________________________________ rhq-users mailing list rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
_______________________________________________ rhq-users mailing list rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.orghttps://lists.fedorahosted.org/mailman/listinfo/rhq-users
_______________________________________________ rhq-users mailing list rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
Matt, I'm fairly sure you only need the two alert definitions:
1 - Email alert on Availability Down, disable when fired 2 - Recovery alert on Availability Up, enabling #1
Only one of the alert definitions will be active at a given time. #2 could also send e-mail as recovery alert definitions can still perform all notifications, I believe.
If you just want to be alerted on any Goes DOWN or Goes UP you could ignore recovery alerting completely and just define two standard alert defs that don't disable. Given the way availability works, you only get alerted on an availability *change* (state change) anyway, so dampening is actually not relevant. But, we have the Availability Duration conditions, which don't match unless the resource Goes DOWN and Stays DOWN for X Minutes, for example.
Having said all that, thanks for the suggestion, we're looking for ways to improve alerting and we know that dampening and recovery are two areas we can likely do better.
On 12/4/2013 9:05 AM, Matt Warren wrote:
Has there ever been discussion of building the recovery concept into a dampening definition? Maybe "transition" dampening which fires just when state changes? In either direction.
Beyond Attila's overlooking the concept, which I think is probably common, it's a usability issue for us that does not scale well. Most all of our alerts we'd prefer only alert us once for entry and once for exit. With the current arrangement, getting alerts for that basic arrangement can mean 4 separate definitions for basically 1 condition. (correct if I'm wrong)
1 - Email alert on Availability Down, disable when fired 2 - Recovery alert on Availability Up, enabling #1 1 - Email alert on Availability Up, disable when fired 2 - Recovery alert on Availability Down, enabling #2
We basically just don't do this because it's not feasible to define 4x the alerts. But we'd love to have that functionality.
Thanks for all the hard work,
Matt
OK!
I hope it works as I expect :)
I define for example
- alert if (open files) > 10000
- this stands for consecutive 2 measurements (optional)
- disabe when fired
- alert if (open files) < 10000
- this stands for consecutive 2 measurements (optional)
- Recover alert: 1st one
In this case the 2nd will re-enable the first one?
Attila
2013/12/3 Jay Shaughnessy <jshaughn@redhat.commailto:jshaughn@redhat.com>
You should look into recovery alerting. When you set disable-on-fire=true it means you will not get any more of that alert until it is re-enabled. That means manual intervention to reset the alert definitions to enabled after the issue is resolved. Recovery alerts can automate this by defining conditions (typically the opposite of what fired the original alert) that when met will re-enable the original alert definition.
On 12/3/2013 4:05 AM, Attila Heidrich wrote: Anyway... I can use the "Disable when fired" option - as you mentioned it also.
Attila
2013/12/3 Attila Heidrich <attila.heidrich@gmail.commailto:attila.heidrich@gmail.com> Thanks really!
It did help me to understand the way it works. Unfortunately I do not perfectly understand the idea behind. It delays the alerts significantly, and we here usually monitor parameters which does not really recover by-itself.
We usually want:
- monitor frequently
- alert ASAP
- do not repeat the same alert for all the trigger event
Even the simplest syslog apps can do this. Doing the other way round could also be interesting for many types of services ... but in many cases there is just no chance that things get better without action - but it's just strange to get the same alert mail in each and every 10/20 minutes - or delay the alert - and no other way.
Thanks again!
Regards,
Attila
2013/12/2 Jay Shaughnessy <jshaughn@redhat.commailto:jshaughn@redhat.com>
Attila,
OK, looking at your dampening definition this is actually a misunderstanding of dampening, the behavior is as expected (even if not as desired). When you say "N Occurrence in X Hours" that means that if the conditions are met N times in X hours that you will generate an alert. If N = 1 the dampening is effectively useless. That just means the conditions have to match 1 time in X hours. As soon as they match you get the alert. After the alert fires the whole thing *resets*. The intent of that dampening rule is to avoid alerting on spikes in activity. Instead, alert only if a certain aberration is consistent.
For example, if "number of open files" sometimes spikes you may not want an alert. But if say you are taking measurements every 10 minutes, and 3 times in an hour you see the same spike , maybe there is a problem. In that case you'd say, "3 occurrences in 1 Hour".
What you want is "Fire no more than 1 times in an hour". We don't really have that sort of dampening, although it may be a nice RFE for a time-based recovery.
We have various dampening, we have the ability to disable-after-fire and we have recovery alerts. Perhaps you will find something that works for you, otherwise feel free to create an RFE BZ with your exact requirement.
For more on dampening and recovery alerting see https://docs.jboss.org/author/display/RHQ/Alerts#Alerts-Dampening.
On 11/28/2013 3:21 AM, Attila Heidrich wrote: It happens to all alerts, does not matter if it is derived from a template, or directly defined for a resource.
f.e. I had to define an alert for the postgres servers on the number of open files. This is defined as [Metric value threshold (Open file descriptors) > 10000] Dempening settings: Time period: 1 Occurence in 1 Hour.
I receive all alerts in every 10 minutes. I have also tried other dampening methods, but none worked for me.
Attila
2013/11/27 Jay Shaughnessy <jshaughn@redhat.commailto:jshaughn@redhat.com>
No, all of the metric conditions should use dampening just fine. Availability is a bit different because of the way we only signal changes in availability and don't supply a constant stream of availability (this is for efficiency). But metrics are reported as scheduled, whether the value changes or not.
Does this happen only when the alert definition is derived from a template? If you define the exact same alert def at the resource level does the dampening work? That would be very strange but I'm just checking.
Can you post the exact alert def conditions and the dampening rule you have in effect.
On 11/27/2013 10:38 AM, Attila Heidrich wrote: Hi!
Thanks a lot, it does help me to understand things! My problem may be rooted in the concept of "discrete events"! anyway... is it possible, that the alert "process CPU consumption (percent) is higher than a given value" is affected by "being a discrete event"? I receive all alerts in every 20 minutes (or whatever the measurement interval is).
The templates are itemized correctly!
Regards,
Attila
2013/11/27 Jay Shaughnessy <jshaughn@redhat.commailto:jshaughn@redhat.com>
Hi Attila, two things. First, what type of alert conditions are you using? Certain conditions, most notably Availability conditions like GOES DOWN, ignore dampening because they are discrete events. Second, there was a recent bug fix in the area of dampening. The hourly data purge was negating some of the dampening state, so dampening that crossed the top of the hour may have been affected. This fix will be in 4.10.
Do either of these things, maybe explain what you are seeing?
Also, you are saying that this is dampening at the alert template level? Check to ensure the dampening is also reflected at the resource level by looking at the alert definition that was propagated down from the template to one of the resources of the relevant type.
On 11/27/2013 4:35 AM, Attila Heidrich wrote: No one encounters similar? It is proven, that I can set anything in the dampening section, I will get each and every alerting mail! Very annoying, tbh...
What to chack if I set it correctly (I think there's nothing I can miss, it's quite simple - and I have also seen this work)? Or any log that should be correlated?
Attila
2013/11/8 Attila Heidrich <attila.heidrich@gmail.commailto:attila.heidrich@gmail.com> Hi!
It seems that there's no use to set any dampening for the alert tamplates, I got all alerts for any of them.
I usually use "one for several hours"... practically I only use templates, so the servers inherits the settings.
Is there any trick I miss? I have used 4.8 only a few days long, but as far as I can remember, it worked as I expected.
Attila
rhq-users mailing list rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.orghttps://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.orghttps://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.orghttps://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.orghttps://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
Am 04.12.2013 15:28, schrieb Jay Shaughnessy:
Matt, I'm fairly sure you only need the two alert definitions:
1 - Email alert on Availability Down, disable when fired 2 - Recovery alert on Availability Up, enabling #1
That's true. But it would be really helpful to extend the system that only one definition is needed.
In nearly all our use cases we have to define a sibling pair of alerts as described by Matt (but with a second email notification in 2, too). Only one definition would reduce the complexity and error-proneness a lot! (Even Nagios has this build in.)
Having said all that, thanks for the suggestion, we're looking for ways to improve alerting and we know that dampening and recovery are two areas we can likely do better.
Nice to hear.
Elmar
Jay, Good points. I realized my example using Availability was not a good one just after sending it. That condition inherently is one of state change. Attila's issue of (open files) > 10000 is a better example (I think).
If condition goes true and stays true, you have to disable. When it goes false and stays false you have to disable.
But point made. As long as the idea might enter discussion.
On 12/4/13, 9:28 AM, "Jay Shaughnessy" jshaughn@redhat.com wrote:
Matt, I'm fairly sure you only need the two alert definitions:
1 - Email alert on Availability Down, disable when fired 2 - Recovery alert on Availability Up, enabling #1
Only one of the alert definitions will be active at a given time. #2 could also send e-mail as recovery alert definitions can still perform all notifications, I believe.
If you just want to be alerted on any Goes DOWN or Goes UP you could ignore recovery alerting completely and just define two standard alert defs that don't disable. Given the way availability works, you only get alerted on an availability *change* (state change) anyway, so dampening is actually not relevant. But, we have the Availability Duration conditions, which don't match unless the resource Goes DOWN and Stays DOWN for X Minutes, for example.
Having said all that, thanks for the suggestion, we're looking for ways to improve alerting and we know that dampening and recovery are two areas we can likely do better.
On 12/4/2013 9:05 AM, Matt Warren wrote:
Has there ever been discussion of building the recovery concept into a dampening definition? Maybe "transition" dampening which fires just when state changes? In either direction.
Beyond Attila's overlooking the concept, which I think is probably common, it's a usability issue for us that does not scale well. Most all of our alerts we'd prefer only alert us once for entry and once for exit. With the current arrangement, getting alerts for that basic arrangement can mean 4 separate definitions for basically 1 condition. (correct if I'm wrong)
1 - Email alert on Availability Down, disable when fired 2 - Recovery alert on Availability Up, enabling #1 1 - Email alert on Availability Up, disable when fired 2 - Recovery alert on Availability Down, enabling #2
We basically just don't do this because it's not feasible to define 4x the alerts. But we'd love to have that functionality.
Thanks for all the hard work,
Matt
OK!
I hope it works as I expect :)
I define for example
- alert if (open files) > 10000
- this stands for consecutive 2 measurements (optional)
- disabe when fired
- alert if (open files) < 10000
- this stands for consecutive 2 measurements (optional)
- Recover alert: 1st one
In this case the 2nd will re-enable the first one?
Attila
2013/12/3 Jay Shaughnessy <jshaughn@redhat.commailto:jshaughn@redhat.com>
You should look into recovery alerting. When you set disable-on-fire=true it means you will not get any more of that alert until it is re-enabled. That means manual intervention to reset the alert definitions to enabled after the issue is resolved. Recovery alerts can automate this by defining conditions (typically the opposite of what fired the original alert) that when met will re-enable the original alert definition.
On 12/3/2013 4:05 AM, Attila Heidrich wrote: Anyway... I can use the "Disable when fired" option - as you mentioned it also.
Attila
2013/12/3 Attila Heidrich <attila.heidrich@gmail.commailto:attila.heidrich@gmail.com> Thanks really!
It did help me to understand the way it works. Unfortunately I do not perfectly understand the idea behind. It delays the alerts significantly, and we here usually monitor parameters which does not really recover by-itself.
We usually want:
- monitor frequently
- alert ASAP
- do not repeat the same alert for all the trigger event
Even the simplest syslog apps can do this. Doing the other way round could also be interesting for many types of services ... but in many cases there is just no chance that things get better without action - but it's just strange to get the same alert mail in each and every 10/20 minutes - or delay the alert - and no other way.
Thanks again!
Regards,
Attila
2013/12/2 Jay Shaughnessy <jshaughn@redhat.commailto:jshaughn@redhat.com>
Attila,
OK, looking at your dampening definition this is actually a misunderstanding of dampening, the behavior is as expected (even if not as desired). When you say "N Occurrence in X Hours" that means that if the conditions are met N times in X hours that you will generate an alert. If N = 1 the dampening is effectively useless. That just means the conditions have to match 1 time in X hours. As soon as they match you get the alert. After the alert fires the whole thing *resets*. The intent of that dampening rule is to avoid alerting on spikes in activity. Instead, alert only if a certain aberration is consistent.
For example, if "number of open files" sometimes spikes you may not want an alert. But if say you are taking measurements every 10 minutes, and 3 times in an hour you see the same spike , maybe there is a problem. In that case you'd say, "3 occurrences in 1 Hour".
What you want is "Fire no more than 1 times in an hour". We don't really have that sort of dampening, although it may be a nice RFE for a time-based recovery.
We have various dampening, we have the ability to disable-after-fire and we have recovery alerts. Perhaps you will find something that works for you, otherwise feel free to create an RFE BZ with your exact requirement.
For more on dampening and recovery alerting see https://docs.jboss.org/author/display/RHQ/Alerts#Alerts-Dampening.
On 11/28/2013 3:21 AM, Attila Heidrich wrote: It happens to all alerts, does not matter if it is derived from a template, or directly defined for a resource.
f.e. I had to define an alert for the postgres servers on the number of open files. This is defined as [Metric value threshold (Open file descriptors) > 10000] Dempening settings: Time period: 1 Occurence in 1 Hour.
I receive all alerts in every 10 minutes. I have also tried other dampening methods, but none worked for me.
Attila
2013/11/27 Jay Shaughnessy <jshaughn@redhat.commailto:jshaughn@redhat.com>
No, all of the metric conditions should use dampening just fine. Availability is a bit different because of the way we only signal changes in availability and don't supply a constant stream of availability (this is for efficiency). But metrics are reported as scheduled, whether the value changes or not.
Does this happen only when the alert definition is derived from a template? If you define the exact same alert def at the resource level does the dampening work? That would be very strange but I'm just checking.
Can you post the exact alert def conditions and the dampening rule you have in effect.
On 11/27/2013 10:38 AM, Attila Heidrich wrote: Hi!
Thanks a lot, it does help me to understand things! My problem may be rooted in the concept of "discrete events"! anyway... is it possible, that the alert "process CPU consumption (percent) is higher than a given value" is affected by "being a discrete event"? I receive all alerts in every 20 minutes (or whatever the measurement interval is).
The templates are itemized correctly!
Regards,
Attila
2013/11/27 Jay Shaughnessy <jshaughn@redhat.commailto:jshaughn@redhat.com>
Hi Attila, two things. First, what type of alert conditions are you using? Certain conditions, most notably Availability conditions like GOES DOWN, ignore dampening because they are discrete events. Second, there was a recent bug fix in the area of dampening. The hourly data purge was negating some of the dampening state, so dampening that crossed the top of the hour may have been affected. This fix will be in 4.10.
Do either of these things, maybe explain what you are seeing?
Also, you are saying that this is dampening at the alert template level? Check to ensure the dampening is also reflected at the resource level by looking at the alert definition that was propagated down from the template to one of the resources of the relevant type.
On 11/27/2013 4:35 AM, Attila Heidrich wrote: No one encounters similar? It is proven, that I can set anything in the dampening section, I will get each and every alerting mail! Very annoying, tbh...
What to chack if I set it correctly (I think there's nothing I can miss, it's quite simple - and I have also seen this work)? Or any log that should be correlated?
Attila
2013/11/8 Attila Heidrich <attila.heidrich@gmail.commailto:attila.heidrich@gmail.com> Hi!
It seems that there's no use to set any dampening for the alert tamplates, I got all alerts for any of them.
I usually use "one for several hours"... practically I only use templates, so the servers inherits the settings.
Is there any trick I miss? I have used 4.8 only a few days long, but as far as I can remember, it worked as I expected.
Attila
rhq-users mailing list
rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list
rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list
rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list
rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list
rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list
rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list
rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list
rhq-users@lists.fedorahosted.orgmailto:rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
rhq-users@lists.stg.fedorahosted.org