We have a somewhat particular RHQ setup where we monitor a large
number of resources remotely from a single agent. Par agent, we have
+/- 25000 scheduled measurements with +/- 1500 measurement collected
per minute. Since most of the metrics are collected with the same
interval (10 minutes), this causes the following problem: when the
agent is started (t=0), it will schedule all these metrics in the same
interval [0s,30s]. However, because of the large number of
measurements, the agent is not able to collect all of them in that 30s
interval and will reschedule the remaining ones to the next interval
in the original schedule, i.e. to [10m,10m+30s]. The same thing again
happens in the interval [10m,10m+30s] and most of the measurements are
rescheduled to the next interval [20m,20m+30s] and so forth. This
means that some metrics are never collected (and are reported as
"late" in the metrics of the RHQ agent).
Note that the issue only occurs after restarting the agent. When the
resources are originally added to the inventory, the corresponding
measurement schedules are spread more or less randomly and the agent
is able to collect all of them.
To solve that issue with RHQ 3.0, I applied the following patch:
Index: src/main/java/org/rhq/core/pc/measurement/MeasurementManager.java
===================================================================
--- src/main/java/org/rhq/core/pc/measurement/MeasurementManager.java
(revision 141630)
+++ src/main/java/org/rhq/core/pc/measurement/MeasurementManager.java
(revision 141631)
@@ -484,6 +484,13 @@
this.scheduledRequests.offer(scheduledMeasurement);
}
}
+
+ public synchronized void reschedule(Set<ScheduledMeasurementInfo>
scheduledMeasurementInfos, long interval) {
+ for (ScheduledMeasurementInfo scheduledMeasurement :
scheduledMeasurementInfos) {
+ scheduledMeasurement.setNextCollection(scheduledMeasurement.getNextCollection()
+ interval);
+ this.scheduledRequests.offer(scheduledMeasurement);
+ }
+ }
/**
* Sends the given measurement report to the server, if this
plugin container has server services that it can
Index: src/main/java/org/rhq/core/pc/measurement/MeasurementCollectorRunner.java
===================================================================
--- src/main/java/org/rhq/core/pc/measurement/MeasurementCollectorRunner.java
(revision 141630)
+++ src/main/java/org/rhq/core/pc/measurement/MeasurementCollectorRunner.java
(revision 141631)
@@ -71,7 +71,7 @@
log.debug("Measurement collection is falling
behind... Missed requested time by ["
+ (System.currentTimeMillis() -
requests.iterator().next().getNextCollection()) + "ms]");
- this.measurementManager.reschedule(requests);
+ this.measurementManager.reschedule(requests, 30000L);
return report;
}
The idea is that instead of rescheduling the measurement according to
the original schedule (e.g. from [0s,30s] to [10m,10m+30s]), it should
simply be rescheduled to the next interval (from [0s,30s] to
[30s,60s]).
We are currently in the process of upgrading to RHQ 4.4. I didn't test
the patch with that version yet, but after looking at the code I think
it is still applicable. I would like to get some feedback about the
approach: is it a valid way to solve the issue or are there better
ways to do that?
Andreas