[I'm sending this to the rhq-users list - it is more appropriate there - the rhq-devel list is for developers of the RHQ codebase]
First why is it trying to fail over to localhost instead of server B and second why the connection refused error? There is no rhq server on this agent box to refuse a connection.
I suspect that this is because your server B might not have its public endpoint declared correctly. Go to Administration>Servers and look at Server B - what is its public endpoint address? Make sure it is correct.
On your agents, what do the failover lists look like? I suspect you will see "127.0.0.1" in the failover list, as opposed to the server B host/IP you expect.
Read this as background:
http://rhq-project.org/display/JOPR2/High+Availability#HighAvailability-Serv...
http://rhq-project.org/display/JOPR2/High+Availability#HighAvailability-Fail...
-------- Original Message -------- Subject: can't get ha failover to work Date: Fri, 15 Oct 2010 12:39:25 -0400 From: Bala Nair bnairtm@comcast.net Reply-To: rhq-devel@lists.fedorahosted.org To: rhq-devel@lists.fedorahosted.org
We're trying to set up an rhq HA cloud with 2 servers and 4 agents and we're having a problem getting the agents to failover to the second server. When we first start up everything all the agents are connected to one server (call it server A) with the other server (server B) not connected to any agents. The failover list on the agent side showing 2 entries (server B and server A in that order). We go to the HA servers page in the gui and see both servers are in NORMAL mode with server A having an agent count of 4 and server B a count of 0. There are no affinity groups. We then set server A to MAINTENANCE mode and wait. I expect the 4 agents connected to server A to failover to server B and to see that in the servers list, but nothing changes. Checking the agent logs I find the following errors:
2010-10-15 11:17:46,653 INFO [RHQ Server Polling Thread] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.changing-endpoint}Communicator is changing endpoint from [InvokerLocator [servlet://mmc-int:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
to [InvokerLocator [servlet://127.0.0.1/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
2010-10-15 11:17:46,654 WARN [RHQ Server Polling Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-failed}Failed to failover to another server. Cause: org.jboss.remoting.CannotConnectException: Can not connect http client invoker. Connection refused.
2010-10-15 11:17:46,658 INFO [RHQ Server Polling Thread] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.changing-endpoint}Communicator is changing endpoint from [InvokerLocator [servlet://127.0.0.1/jboss-remoting-servlet-invoker/ServerInvokerServlet]] to [InvokerLocator [servlet://mmc-int:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
2010-10-15 11:17:46,661 WARN [RHQ Server Polling Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-failed}Failed to failover to another server. Cause: org.rhq.enterprise.communications.util.NotProcessedException
2010-10-15 11:17:46,663 WARN [RHQ Server Polling Thread] (org.rhq.enterprise.agent.FailoverFailureCallback)- {AgentMain.too-many-failover-attempts}Too many failover attempts have been made [2]. Exception that triggered the failover: [org.rhq.enterprise.communications.util.NotProcessedException]
2010-10-15 11:17:46,663 ERROR [RHQ Server Polling Thread] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.init-callback-failed}The initialize callback has failed. It will be tried again. Cause: org.rhq.enterprise.communications.util.NotProcessedException:null. Cause: org.rhq.enterprise.communications.util.NotProcessedException
In this case mmc-int is server A. I can understand the second series of errors where it tries to fail back to mmc-int and fails because mmc-int is in maintenance mode. I don't understand the initial failure though. First why is it trying to fail over to localhost instead of server B and second why the connection refused error? There is no rhq server on this agent box to refuse a connection.
I have looked through all the agent and server configuration properties and I just don't see how the localhost address is getting set in this case. Any help would be appreciated. Thanks.
Bala Nair SeaChange International
_______________________________________________ rhq-devel mailing list rhq-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/rhq-devel
So, I think we found the problem - server B public endpoint address was mmc_int2 and the agent machine's /etc/hosts file mapped that name to the correct ip address. We changed the endpoint to the ip address of the server and restarted the agent. Failover started working correctly. We then changed the endpoint to mmc-int2, changed the hosts file on the agent box and restarted the agent. Once again everything worked fine. So I switched everything back to mmc_int2 just to verify that it was a name issue and failover stopped working again. So it looks like an underscore in an endpoint name causes connection failure. Is there something about the naming rules that I'm just not aware of or is this really a bug?
Bala Nair SeaChange International
On 10/15/10 12:45 PM, John Mazzitelli wrote:
[I'm sending this to the rhq-users list - it is more appropriate there
- the rhq-devel list is for developers of the RHQ codebase]
First why is it trying to fail over to localhost instead of server B and second why the connection refused error? There is no rhq server on this agent box to refuse a connection.
I suspect that this is because your server B might not have its public endpoint declared correctly. Go to Administration>Servers and look at Server B - what is its public endpoint address? Make sure it is correct.
On your agents, what do the failover lists look like? I suspect you will see "127.0.0.1" in the failover list, as opposed to the server B host/IP you expect.
Read this as background:
http://rhq-project.org/display/JOPR2/High+Availability#HighAvailability-Serv...
http://rhq-project.org/display/JOPR2/High+Availability#HighAvailability-Fail...
-------- Original Message -------- Subject: can't get ha failover to work Date: Fri, 15 Oct 2010 12:39:25 -0400 From: Bala Nair bnairtm@comcast.net Reply-To: rhq-devel@lists.fedorahosted.org To: rhq-devel@lists.fedorahosted.org
We're trying to set up an rhq HA cloud with 2 servers and 4 agents and we're having a problem getting the agents to failover to the second server. When we first start up everything all the agents are connected to one server (call it server A) with the other server (server B) not connected to any agents. The failover list on the agent side showing 2 entries (server B and server A in that order). We go to the HA servers page in the gui and see both servers are in NORMAL mode with server A having an agent count of 4 and server B a count of 0. There are no affinity groups. We then set server A to MAINTENANCE mode and wait. I expect the 4 agents connected to server A to failover to server B and to see that in the servers list, but nothing changes. Checking the agent logs I find the following errors:
2010-10-15 11:17:46,653 INFO [RHQ Server Polling Thread] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)-
{JBossRemotingRemoteCommunicator.changing-endpoint}Communicator is changing endpoint from [InvokerLocator [servlet://mmc-int:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
to [InvokerLocator [servlet://127.0.0.1/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
2010-10-15 11:17:46,654 WARN [RHQ Server Polling Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-failed}Failed to failover to another server. Cause: org.jboss.remoting.CannotConnectException: Can not connect http client invoker. Connection refused.
2010-10-15 11:17:46,658 INFO [RHQ Server Polling Thread] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)-
{JBossRemotingRemoteCommunicator.changing-endpoint}Communicator is changing endpoint from [InvokerLocator [servlet://127.0.0.1/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
to [InvokerLocator [servlet://mmc-int:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
2010-10-15 11:17:46,661 WARN [RHQ Server Polling Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-failed}Failed to failover to another server. Cause: org.rhq.enterprise.communications.util.NotProcessedException
2010-10-15 11:17:46,663 WARN [RHQ Server Polling Thread] (org.rhq.enterprise.agent.FailoverFailureCallback)- {AgentMain.too-many-failover-attempts}Too many failover attempts have been made [2]. Exception that triggered the failover: [org.rhq.enterprise.communications.util.NotProcessedException]
2010-10-15 11:17:46,663 ERROR [RHQ Server Polling Thread] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)-
{JBossRemotingRemoteCommunicator.init-callback-failed}The initialize callback has failed. It will be tried again. Cause: org.rhq.enterprise.communications.util.NotProcessedException:null. Cause: org.rhq.enterprise.communications.util.NotProcessedException
In this case mmc-int is server A. I can understand the second series of errors where it tries to fail back to mmc-int and fails because mmc-int is in maintenance mode. I don't understand the initial failure though. First why is it trying to fail over to localhost instead of server B and second why the connection refused error? There is no rhq server on this agent box to refuse a connection.
I have looked through all the agent and server configuration properties and I just don't see how the localhost address is getting set in this case. Any help would be appreciated. Thanks.
Bala Nair SeaChange International
rhq-devel mailing list rhq-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/rhq-devel
http://en.wikipedia.org/wiki/Hostname#Restrictions_on_valid_host_names
"While a hostname may not contain other characters, such as the underscore character (_)..."
"A notable example of non-compliance with this specification, Microsoft Windows systems often use underscores in hostnames. Since some systems will reject invalid hostnames while others will not, the use of invalid hostname characters may cause subtle problems in systems that connect to standards-based services. For example, RFC-compliant mail servers will refuse to deliver mail for MS Windows computers with names containing underscores."
"The Internet standards (Request for Comments) for protocols mandate that component hostname labels may contain only the ASCII letters 'a' through 'z' (in a case-insensitive manner), the digits '0' through '9', and the hyphen ('-'). The original specification of hostnames in RFC 952, mandated that labels could not start with a digit or with a hyphen, and must not end with a hyphen. However, a subsequent specification (RFC 1123) permitted hostname labels to start with digits. *No other symbols, punctuation characters, or white space are permitted.*" (emphasis added)
On 10/15/2010 02:13 PM, Bala Nair wrote:
So, I think we found the problem - server B public endpoint address was mmc_int2 and the agent machine's /etc/hosts file mapped that name to the correct ip address. We changed the endpoint to the ip address of the server and restarted the agent. Failover started working correctly. We then changed the endpoint to mmc-int2, changed the hosts file on the agent box and restarted the agent. Once again everything worked fine. So I switched everything back to mmc_int2 just to verify that it was a name issue and failover stopped working again. So it looks like an underscore in an endpoint name causes connection failure. Is there something about the naming rules that I'm just not aware of or is this really a bug?
Bala Nair SeaChange International
On 10/15/10 12:45 PM, John Mazzitelli wrote:
[I'm sending this to the rhq-users list - it is more appropriate there
- the rhq-devel list is for developers of the RHQ codebase]
First why is it trying to fail over to localhost instead of server B and second why the connection refused error? There is no rhq server on this agent box to refuse a connection.
I suspect that this is because your server B might not have its public endpoint declared correctly. Go to Administration>Servers and look at Server B - what is its public endpoint address? Make sure it is correct.
On your agents, what do the failover lists look like? I suspect you will see "127.0.0.1" in the failover list, as opposed to the server B host/IP you expect.
Read this as background:
http://rhq-project.org/display/JOPR2/High+Availability#HighAvailability-Serv...
http://rhq-project.org/display/JOPR2/High+Availability#HighAvailability-Fail...
-------- Original Message -------- Subject: can't get ha failover to work Date: Fri, 15 Oct 2010 12:39:25 -0400 From: Bala Nairbnairtm@comcast.net Reply-To: rhq-devel@lists.fedorahosted.org To: rhq-devel@lists.fedorahosted.org
We're trying to set up an rhq HA cloud with 2 servers and 4 agents and we're having a problem getting the agents to failover to the second server. When we first start up everything all the agents are connected to one server (call it server A) with the other server (server B) not connected to any agents. The failover list on the agent side showing 2 entries (server B and server A in that order). We go to the HA servers page in the gui and see both servers are in NORMAL mode with server A having an agent count of 4 and server B a count of 0. There are no affinity groups. We then set server A to MAINTENANCE mode and wait. I expect the 4 agents connected to server A to failover to server B and to see that in the servers list, but nothing changes. Checking the agent logs I find the following errors:
2010-10-15 11:17:46,653 INFO [RHQ Server Polling Thread] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)-
{JBossRemotingRemoteCommunicator.changing-endpoint}Communicator is changing endpoint from [InvokerLocator [servlet://mmc-int:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
to [InvokerLocator [servlet://127.0.0.1/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
2010-10-15 11:17:46,654 WARN [RHQ Server Polling Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-failed}Failed to failover to another server. Cause: org.jboss.remoting.CannotConnectException: Can not connect http client invoker. Connection refused.
2010-10-15 11:17:46,658 INFO [RHQ Server Polling Thread] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)-
{JBossRemotingRemoteCommunicator.changing-endpoint}Communicator is changing endpoint from [InvokerLocator [servlet://127.0.0.1/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
to [InvokerLocator [servlet://mmc-int:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
2010-10-15 11:17:46,661 WARN [RHQ Server Polling Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-failed}Failed to failover to another server. Cause: org.rhq.enterprise.communications.util.NotProcessedException
2010-10-15 11:17:46,663 WARN [RHQ Server Polling Thread] (org.rhq.enterprise.agent.FailoverFailureCallback)- {AgentMain.too-many-failover-attempts}Too many failover attempts have been made [2]. Exception that triggered the failover: [org.rhq.enterprise.communications.util.NotProcessedException]
2010-10-15 11:17:46,663 ERROR [RHQ Server Polling Thread] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)-
{JBossRemotingRemoteCommunicator.init-callback-failed}The initialize callback has failed. It will be tried again. Cause: org.rhq.enterprise.communications.util.NotProcessedException:null. Cause: org.rhq.enterprise.communications.util.NotProcessedException
In this case mmc-int is server A. I can understand the second series of errors where it tries to fail back to mmc-int and fails because mmc-int is in maintenance mode. I don't understand the initial failure though. First why is it trying to fail over to localhost instead of server B and second why the connection refused error? There is no rhq server on this agent box to refuse a connection.
I have looked through all the agent and server configuration properties and I just don't see how the localhost address is getting set in this case. Any help would be appreciated. Thanks.
Bala Nair SeaChange International
rhq-devel mailing list rhq-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/rhq-devel
rhq-users mailing list rhq-users@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/rhq-users
Doh!!
On 10/15/10 2:19 PM, John Mazzitelli wrote:
http://en.wikipedia.org/wiki/Hostname#Restrictions_on_valid_host_names
"While a hostname may not contain other characters, such as the underscore character (_)..."
"A notable example of non-compliance with this specification, Microsoft Windows systems often use underscores in hostnames. Since some systems will reject invalid hostnames while others will not, the use of invalid hostname characters may cause subtle problems in systems that connect to standards-based services. For example, RFC-compliant mail servers will refuse to deliver mail for MS Windows computers with names containing underscores."
"The Internet standards (Request for Comments) for protocols mandate that component hostname labels may contain only the ASCII letters 'a' through 'z' (in a case-insensitive manner), the digits '0' through '9', and the hyphen ('-'). The original specification of hostnames in RFC 952, mandated that labels could not start with a digit or with a hyphen, and must not end with a hyphen. However, a subsequent specification (RFC 1123) permitted hostname labels to start with digits. *No other symbols, punctuation characters, or white space are permitted.*" (emphasis added)
On 10/15/2010 02:13 PM, Bala Nair wrote:
So, I think we found the problem - server B public endpoint address
was mmc_int2 and the agent machine's /etc/hosts file mapped that name to the correct ip address. We changed the endpoint to the ip address of the server and restarted the agent. Failover started working correctly. We then changed the endpoint to mmc-int2, changed the hosts file on the agent box and restarted the agent. Once again everything worked fine. So I switched everything back to mmc_int2 just to verify that it was a name issue and failover stopped working again. So it looks like an underscore in an endpoint name causes connection failure. Is there something about the naming rules that I'm just not aware of or is this really a bug?
Bala Nair SeaChange International
On 10/15/10 12:45 PM, John Mazzitelli wrote:
[I'm sending this to the rhq-users list - it is more appropriate there
- the rhq-devel list is for developers of the RHQ codebase]
First why is it trying to fail over to localhost instead of server B and second why the connection refused error? There is no rhq server on this agent box to refuse a connection.
I suspect that this is because your server B might not have its public endpoint declared correctly. Go to Administration>Servers and look at Server B - what is its public endpoint address? Make sure it is correct.
On your agents, what do the failover lists look like? I suspect you will see "127.0.0.1" in the failover list, as opposed to the server B host/IP you expect.
Read this as background:
http://rhq-project.org/display/JOPR2/High+Availability#HighAvailability-Serv...
http://rhq-project.org/display/JOPR2/High+Availability#HighAvailability-Fail...
-------- Original Message -------- Subject: can't get ha failover to work Date: Fri, 15 Oct 2010 12:39:25 -0400 From: Bala Nairbnairtm@comcast.net Reply-To: rhq-devel@lists.fedorahosted.org To: rhq-devel@lists.fedorahosted.org
We're trying to set up an rhq HA cloud with 2 servers and 4 agents and
we're having a problem getting the agents to failover to the second server. When we first start up everything all the agents are connected to one server (call it server A) with the other server (server B) not connected to any agents. The failover list on the agent side showing 2 entries (server B and server A in that order). We go to the HA servers page in the gui and see both servers are in NORMAL mode with server A having an agent count of 4 and server B a count of 0. There are no affinity groups. We then set server A to MAINTENANCE mode and wait. I expect the 4 agents connected to server A to failover to server B and to see that in the servers list, but nothing changes. Checking the agent logs I find the following errors:
2010-10-15 11:17:46,653 INFO [RHQ Server Polling Thread] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)-
{JBossRemotingRemoteCommunicator.changing-endpoint}Communicator is changing endpoint from [InvokerLocator [servlet://mmc-int:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
to [InvokerLocator [servlet://127.0.0.1/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
2010-10-15 11:17:46,654 WARN [RHQ Server Polling Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-failed}Failed to failover to another server. Cause: org.jboss.remoting.CannotConnectException: Can not connect http client invoker. Connection refused.
2010-10-15 11:17:46,658 INFO [RHQ Server Polling Thread] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)-
{JBossRemotingRemoteCommunicator.changing-endpoint}Communicator is changing endpoint from [InvokerLocator [servlet://127.0.0.1/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
to [InvokerLocator [servlet://mmc-int:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
2010-10-15 11:17:46,661 WARN [RHQ Server Polling Thread] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-failed}Failed to failover to another server. Cause: org.rhq.enterprise.communications.util.NotProcessedException
2010-10-15 11:17:46,663 WARN [RHQ Server Polling Thread] (org.rhq.enterprise.agent.FailoverFailureCallback)- {AgentMain.too-many-failover-attempts}Too many failover attempts have been made [2]. Exception that triggered the failover: [org.rhq.enterprise.communications.util.NotProcessedException]
2010-10-15 11:17:46,663 ERROR [RHQ Server Polling Thread] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)-
{JBossRemotingRemoteCommunicator.init-callback-failed}The initialize callback has failed. It will be tried again. Cause: org.rhq.enterprise.communications.util.NotProcessedException:null. Cause: org.rhq.enterprise.communications.util.NotProcessedException
In this case mmc-int is server A. I can understand the second series of errors where it tries to fail back to mmc-int and fails because mmc-int is in maintenance mode. I don't understand the initial failure though. First why is it trying to fail over to localhost instead of server B and second why the connection refused error? There is no rhq server on this agent box to refuse a connection.
I have looked through all the agent and server configuration properties and I just don't see how the localhost address is getting set in this case. Any help would be appreciated. Thanks.
Bala Nair SeaChange International
rhq-devel mailing list rhq-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/rhq-devel
rhq-users mailing list rhq-users@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/rhq-users
rhq-users@lists.stg.fedorahosted.org