Anytime I see odd metric/availability data like this, the first thing I look at is the clocks on the server and agent. You must ensure both server and agent have synchronized clocks. Use NTP to auto-sync machines to the same time.
I'm not sure what the problem could be if that isn't it. Try to uninventory that server again.
On 09/27/2010 08:17 AM, Alexey Kamenchuk wrote:
Hi
I have un-inventoried one of the jboss servers which used to be hosted by the platform. A bit later I spotted that the platform and all of its servers appear as down while all the metrics (of all the servers and services) are being collected correctly. Restarting RHQ agent didnt help. Agent logs dont reveal any problem. Any idea how to cure this?
Regards Alexey
rhq-users mailing list rhq-users@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/rhq-users
I've seen something similar in the past when our RHQ server suffered a database outage. In our case a JBoss Server was marked down but all metrics were gathered. Eventually after a few hours the JBoss server was marked as up. In this cluster we had 33 other JBoss servers but they had correct availabilities.
I trawled through the logs but could not see anything obvious the agent was reporting that it was sending availability information to the server.
Steve Millidge
Director
C2B2
Providing the foundations for Enterprise Scale Java.
T: 08450 539457
M: 07920 100626
W: http://www.c2b2.co.uk/ www.c2b2.co.uk
E: mailto:smillidge@c2b2.co.uk smillidge@c2b2.co.uk
From: rhq-users-bounces@lists.fedorahosted.org [mailto:rhq-users-bounces@lists.fedorahosted.org] On Behalf Of Alexey Kamenchuk Sent: 27 September 2010 13:17 To: rhq-users@lists.fedorahosted.org Subject: the whole platform appears as down while metrics are being successfully collected
Hi
I have un-inventoried one of the jboss servers which used to be hosted by the platform. A bit later I spotted that the platform and all of its servers appear as down while all the metrics (of all the servers and services) are being collected correctly. Restarting RHQ agent didnt help. Agent logs dont reveal any problem. Any idea how to cure this?
Regards Alexey
Ah. Now that could be something. If the server is down (or the server otherwise fails to successfully process availability reports because, say, the DB is down), the agent's availability reports (and metric reports and other things) will fail to get persisted.
The metric reports do have guaranteed delivery - so if it fails to persist, the agent hangs on to those reports until such time that the server can successfully process and store them. Once the server is OK, the agent will resend its old metric reports so the old data isn't lost.
The same is NOT true for availability reports - they are not guaranteed (there are reasons for this - see bugzilla - there's an item in there that talks about this). So if an avail report fails to get processed, the agent just fires and forgets it - it'll send its next avail report and hope the server can process it. It will keep trying new reports but not send old avail reports.
The end effect would be that you'd see red availability but full metric data (as your graph showed).
On 09/28/2010 05:07 AM, Steve Millidge wrote:
I’ve seen something similar in the past when our RHQ server suffered a database outage. In our case a JBoss Server was marked down but all metrics were gathered. Eventually after a few hours the JBoss server was marked as up. In this cluster we had 33 other JBoss servers but they had correct availabilities.
I trawled through the logs but could not see anything obvious the agent was reporting that it was sending availability information to the server.
Steve Millidge
Director
C2B2
*Providing the foundations for Enterprise Scale Java.*
T: 08450 539457
M: 07920 100626
W: www.c2b2.co.uk http://www.c2b2.co.uk/
E: smillidge@c2b2.co.uk mailto:smillidge@c2b2.co.uk
*From:* rhq-users-bounces@lists.fedorahosted.org [mailto:rhq-users-bounces@lists.fedorahosted.org] *On Behalf Of *Alexey Kamenchuk *Sent:* 27 September 2010 13:17 *To:* rhq-users@lists.fedorahosted.org *Subject:* the whole platform appears as down while metrics are being successfully collected
Hi
I have un-inventoried one of the jboss servers which used to be hosted by the platform. A bit later I spotted that the platform and all of its servers appear as down while all the metrics (of all the servers and services) are being collected correctly. Restarting RHQ agent didnt help. Agent logs dont reveal any problem. Any idea how to cure this?
Regards Alexey
rhq-users mailing list rhq-users@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/rhq-users
Hi John,
Possibly my issue also originated from the avail report persisting problem: RHQ server went down and I had to restart it . After that I deleted a jboss from the inventory and spotted that platform is down while metrics gets collected
But I waited for nearly 3 days and it still appears as down. So I deleted it and added again. Was there a better way to fix the issue?
Regards Alexey
On 28.09.2010 16:18, John Mazzitelli wrote:
Ah. Now that could be something. If the server is down (or the server otherwise fails to successfully process availability reports because, say, the DB is down), the agent's availability reports (and metric reports and other things) will fail to get persisted.
The metric reports do have guaranteed delivery - so if it fails to persist, the agent hangs on to those reports until such time that the server can successfully process and store them. Once the server is OK, the agent will resend its old metric reports so the old data isn't lost.
The same is NOT true for availability reports - they are not guaranteed (there are reasons for this - see bugzilla - there's an item in there that talks about this). So if an avail report fails to get processed, the agent just fires and forgets it - it'll send its next avail report and hope the server can process it. It will keep trying new reports but not send old avail reports.
The end effect would be that you'd see red availability but full metric data (as your graph showed).
I've been seeing this as well lately if a server is bounced. Running "avail -f" on the agent clears it up by forcing a full avail report instead of changes only. I think this bug is relatively new.
-Greg
On Sep 28, 2010, at 8:35 AM, Alexey Kamenchuk ak@topdog.ru.net wrote:
Hi John,
Possibly my issue also originated from the avail report persisting problem: RHQ server went down and I had to restart it . After that I deleted a jboss from the inventory and spotted that platform is down while metrics gets collected
But I waited for nearly 3 days and it still appears as down. So I deleted it and added again. Was there a better way to fix the issue?
Regards Alexey
On 28.09.2010 16:18, John Mazzitelli wrote:
Ah. Now that could be something. If the server is down (or the server otherwise fails to successfully process availability reports because, say, the DB is down), the agent's availability reports (and metric reports and other things) will fail to get persisted.
The metric reports do have guaranteed delivery - so if it fails to persist, the agent hangs on to those reports until such time that the server can successfully process and store them. Once the server is OK, the agent will resend its old metric reports so the old data isn't lost.
The same is NOT true for availability reports - they are not guaranteed (there are reasons for this - see bugzilla - there's an item in there that talks about this). So if an avail report fails to get processed, the agent just fires and forgets it - it'll send its next avail report and hope the server can process it. It will keep trying new reports but not send old avail reports.
The end effect would be that you'd see red availability but full metric data (as your graph showed).
rhq-users mailing list rhq-users@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/rhq-users
rhq-users@lists.stg.fedorahosted.org