We are attempting to do some performance testing of rhq using agent copy and the perftest plugin on a 2 server HA system. Our initial tests of 50 agents per server up to 200 per server went well. We saw the expected behavior of increased network traffic and database inserts, still capturing all the data with no errors in the server logs. This past weekend we scaled the system to 400 agents per server, at which point things started to fall over. We started getting the following postgres errors in the server log and all our platforms were marked as not available.
2011-02-22 11:42:22,616 WARN [org.hibernate.util.JDBCExceptionReporter] SQL Error: 0, SQLState: null 2011-02-22 11:42:22,616 ERROR [org.hibernate.util.JDBCExceptionReporter] Could not create connection; - nested throwable: (org.postgresql.util.PSQLException: FATAL: sorry, too many clients already); - nested throwable: (org.jboss.resource.JBossResourceException: Could not create connection; - nested throwable: (org.postgresql.util.PSQLException: FATAL: sorry, too many clients already)) 2011-02-22 11:42:22,617 INFO [org.rhq.enterprise.server.discovery.DiscoveryServerServiceImpl] Error processing availability report from [agentcopy-53046]: javax.ejb.EJBTransactionRolledbackException:org.hibernate.exception.GenericJDBCException: Cannot open connection -> javax.persistence.PersistenceException:org.hibernate.exception.GenericJDBCException: Cannot open connection -> org.hibernate.exception.GenericJDBCException:Cannot open connection -> org.jboss.util.NestedSQLException:Could not create connection; - nested throwable: (org.postgresql.util.PSQLException: FATAL: sorry, too many clients already); - nested throwable: (org.jboss.resource.JBossResourceException: Could not create connection; - nested throwable: (org.postgresql.util.PSQLException: FATAL: sorry, too many clients already))[SQLException=Could not create connection; - nested throwable: (org.postgresql.util.PSQLException: FATAL: sorry, too many clients already); - nested throwable: (org.jboss.resource.JBossResourceException: Could not create connection; - nested throwable: (org.postgresql.util.PSQLException: FATAL: sorry, too many clients already))] -> org.jboss.resource.JBossResourceException:Could not create connection; - nested throwable: (org.postgresql.util.PSQLException: FATAL: sorry, too many clients already) -> org.postgresql.util.PSQLException:FATAL: sorry, too many clients already[SQLException=FATAL: sorry, too many clients already] 2011-02-22 11:42:22,715 WARN [org.jboss.resource.connectionmanager.JBossManagedConnectionPool] Throwable while attempting to get a new connection: null org.jboss.resource.JBossResourceException: Could not create connection; - nested throwable: (org.postgresql.util.PSQLException: FATAL: sorry, too many clients already) at org.jboss.resource.adapter.jdbc.xa.XAManagedConnectionFactory.createManagedConnection(XAManagedConnectionFactory.java:155) at org.jboss.resource.connectionmanager.InternalManagedConnectionPool.createConnectionEventListener(InternalManagedConnectionPool.java:619) at org.jboss.resource.connectionmanager.InternalManagedConnectionPool.getConnection(InternalManagedConnectionPool.java:264) at org.jboss.resource.connectionmanager.JBossManagedConnectionPool$BasePool.getConnection(JBossManagedConnectionPool.java:613) at org.jboss.resource.connectionmanager.BaseConnectionManager2.getManagedConnection(BaseConnectionManager2.java:347) at org.jboss.resource.connectionmanager.TxConnectionManager.getManagedConnection(TxConnectionManager.java:330) at org.jboss.resource.connectionmanager.BaseConnectionManager2.allocateConnection(BaseConnectionManager2.java:402) at org.jboss.resource.connectionmanager.BaseConnectionManager2$ConnectionManagerProxy.allocateConnection(BaseConnectionManager2.java:849) at org.jboss.resource.adapter.jdbc.WrapperDataSource.getConnection(WrapperDataSource.java:89) at org.hibernate.ejb.connection.InjectedDataSourceConnectionProvider.getConnection(InjectedDataSourceConnectionProvider.java:47) at org.hibernate.jdbc.ConnectionManager.openConnection(ConnectionManager.java:423) at org.hibernate.jdbc.ConnectionManager.getConnection(ConnectionManager.java:144) at org.hibernate.jdbc.AbstractBatcher.prepareQueryStatement(AbstractBatcher.java:140) at org.hibernate.loader.Loader.prepareQueryStatement(Loader.java:1547) at org.hibernate.loader.Loader.doQuery(Loader.java:673) at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236) at org.hibernate.loader.Loader.doList(Loader.java:2213) at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2104) at org.hibernate.loader.Loader.list(Loader.java:2099) at org.hibernate.loader.hql.QueryLoader.list(QueryLoader.java:378) at org.hibernate.hql.ast.QueryTranslatorImpl.list(QueryTranslatorImpl.java:338) at org.hibernate.engine.query.HQLQueryPlan.performList(HQLQueryPlan.java:172) at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1121) at org.hibernate.impl.QueryImpl.list(QueryImpl.java:79) at org.hibernate.ejb.QueryImpl.getSingleResult(QueryImpl.java:80) at org.rhq.enterprise.server.core.AgentManagerBean.getAgentByAgentToken(AgentManagerBean.java:306) at sun.reflect.GeneratedMethodAccessor261.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:112) at org.jboss.ejb3.interceptor.InvocationContextImpl.proceed(InvocationContextImpl.java:166) at org.jboss.ejb3.interceptor.EJB3InterceptorsInterceptor.invoke(EJB3InterceptorsInterceptor.java:63) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.ejb3.entity.TransactionScopedEntityManagerInterceptor.invoke(TransactionScopedEntityManagerInterceptor.java:54) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.ejb3.AllowedOperationsInterceptor.invoke(AllowedOperationsInterceptor.java:47) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.aspects.tx.TxPolicy.invokeInOurTx(TxPolicy.java:79) at org.jboss.aspects.tx.TxInterceptor$Required.invoke(TxInterceptor.java:191) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.aspects.tx.TxPropagationInterceptor.invoke(TxPropagationInterceptor.java:95) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.ejb3.stateless.StatelessInstanceInterceptor.invoke(StatelessInstanceInterceptor.java:62) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.aspects.security.AuthenticationInterceptor.invoke(AuthenticationInterceptor.java:77) at org.jboss.ejb3.security.Ejb3AuthenticationInterceptor.invoke(Ejb3AuthenticationInterceptor.java:110) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.ejb3.ENCPropagationInterceptor.invoke(ENCPropagationInterceptor.java:46) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.ejb3.asynchronous.AsynchronousInterceptor.invoke(AsynchronousInterceptor.java:106) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.ejb3.stateless.StatelessContainer.localInvoke(StatelessContainer.java:240) at org.jboss.ejb3.stateless.StatelessContainer.localInvoke(StatelessContainer.java:210) at org.jboss.ejb3.stateless.StatelessLocalProxy.invoke(StatelessLocalProxy.java:84) at $Proxy263.getAgentByAgentToken(Unknown Source) at org.rhq.enterprise.server.core.comm.SecurityTokenCommandAuthenticator.isAuthenticated(SecurityTokenCommandAuthenticator.java:96) at org.rhq.enterprise.communications.command.server.CommandProcessor.handleIncomingInvocationRequest(CommandProcessor.java:246) at org.rhq.enterprise.communications.command.server.CommandProcessor.invoke(CommandProcessor.java:184) at org.jboss.remoting.ServerInvoker.invoke(ServerInvoker.java:809) at org.jboss.remoting.transport.servlet.ServletServerInvoker.processRequest(ServletServerInvoker.java:232) at sun.reflect.GeneratedMethodAccessor202.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.jboss.mx.interceptor.ReflectedDispatcher.invoke(ReflectedDispatcher.java:155) at org.jboss.mx.server.Invocation.dispatch(Invocation.java:94) at org.jboss.mx.server.Invocation.invoke(Invocation.java:86) at org.jboss.mx.server.AbstractMBeanInvoker.invoke(AbstractMBeanInvoker.java:264) at org.jboss.mx.server.MBeanServerImpl.invoke(MBeanServerImpl.java:659) at javax.management.MBeanServerInvocationHandler.invoke(MBeanServerInvocationHandler.java:288) at $Proxy421.processRequest(Unknown Source) at org.jboss.remoting.transport.servlet.web.ServerInvokerServlet.processRequest(ServerInvokerServlet.java:128) at org.jboss.remoting.transport.servlet.web.ServerInvokerServlet.doPost(ServerInvokerServlet.java:157) at javax.servlet.http.HttpServlet.service(HttpServlet.java:710) at javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.rhq.helpers.rtfilter.filter.RtFilter.doFilter(RtFilter.java:124) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182) at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:446) at java.lang.Thread.run(Thread.java:619) Caused by: org.postgresql.util.PSQLException: FATAL: sorry, too many clients already at org.postgresql.core.v3.ConnectionFactoryImpl.readStartupMessages(ConnectionFactoryImpl.java:464) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:112) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:66) at org.postgresql.jdbc2.AbstractJdbc2Connection.<init>(AbstractJdbc2Connection.java:125) at org.postgresql.jdbc3.AbstractJdbc3Connection.<init>(AbstractJdbc3Connection.java:30) at org.postgresql.jdbc3.Jdbc3Connection.<init>(Jdbc3Connection.java:24) at org.postgresql.Driver.makeConnection(Driver.java:393) at org.postgresql.Driver.connect(Driver.java:267) at java.sql.DriverManager.getConnection(DriverManager.java:582) at java.sql.DriverManager.getConnection(DriverManager.java:185) at org.postgresql.ds.common.BaseDataSource.getConnection(BaseDataSource.java:87) at org.postgresql.xa.PGXADataSource.getXAConnection(PGXADataSource.java:47) at org.jboss.resource.adapter.jdbc.xa.XAManagedConnectionFactory.createManagedConnection(XAManagedConnectionFactory.java:137) ... 94 more
At this point I figured we had exceeded the total number of available connections to the database so I increased the max connections in postgresql.conf from 60 to 120. This seemed to work for about an hour, but now I'm seeing a different sql error in the server log:
2011-02-22 14:38:36,617 WARN [org.hibernate.util.JDBCExceptionReporter] SQL Error: 0, SQLState: null 2011-02-22 14:38:36,617 ERROR [org.hibernate.util.JDBCExceptionReporter] Could not create connection; - nested throwable: (org.postgresql.util.PSQLException: FATAL: connection limit exceeded for non-superusers); - nested throwable: (org.jboss.resource.JBossResourceException: Could not create connection; - nested throwable: (org.postgresql.util.PSQLException: FATAL: connection limit exceeded for non-superusers)) 2011-02-22 14:38:37,911 WARN [org.rhq.enterprise.communications.command.server.CommandProcessor] {CommandProcessor.failed-authentication}Command failed to be authenticated! This command will be ignored and not processed: Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.guaranteed-delivery=true, rhq.agent-name=agentcopy-52019, rhq.security-token=qx2pzmh7AX/RILHfKQCOuEq6ILOAtLQSRefrX0SzpRrLw541w1ey3ZJIr3onEd2XOlc=, rhq.failover-attempts=4, rhq.externalizable-strategy=AGENT, rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[mergeMeasurementReport], targetInterfaceName=org.rhq.core.clientapi.server.measurement.MeasurementServerService}] 2011-02-22 14:38:38,472 INFO [org.rhq.enterprise.server.core.CoreServerServiceImpl] Agent [agentcopy-40003][3.0.0(59e9341)] would like to connect to this server 2011-02-22 14:38:38,484 WARN [org.jboss.resource.connectionmanager.JBossManagedConnectionPool] Throwable while attempting to get a new connection: null org.jboss.resource.JBossResourceException: Could not create connection; - nested throwable: (org.postgresql.util.PSQLException: FATAL: connection limit exceeded for non-superusers)
I understand these are warnings, probably happening because all 800 agents suddenly saw an available server, but shouldn't the server's global concurrency limit prevent excess requests from the agents from being attempted? These are already inventoried agents trying to synchronize inventories, send availablity reports, etc. Are there any tuning parameters I can tweak to make this cleaner? I'm letting the system run hoping that it will return to a steady state eventually. Have we just exceeded the number of agents per server that the system can comfortably handle and we should scale back to 50-100?
Bala Nair SeaChange International
Depending on the backend, 400 agents to a single server may be too much. That said, I remember about a year ago someone tested using a database with SSD storage and it was blazing fast and probably could have handled more than 400 agents per server. It (mainly) depends on the database - how well it is tuned and how fast the disks are. The number of managed resources also has an impact - if your 800 agents are only monitoring, say, only the operating system and its child services, it wouldn't be as bad as if all 800 are monitoring a couple JBossAS instances each (the difference is the number of measurement reports that each agent would be sending up to the server).
Anyway, aside from that, global concurrency limits should help. But they only delay the agents reports getting into the server. Once a concurrency limit is hit, the agents are told to "hold" for a few seconds (up to I think a minute or so) and then try the message again and if there is room, the message goes through (otherwise, the concurrency limit kicks in again and the agent waits again). Its essentially a throttling mechanism.
If you are starting all your 800 agents at roughly the same time (using the agentspawn stuff, you normally launch them about 1m apart) then the server's will get clobbered with all of their synch requests and initial inventory reports. This causes immense traffic (which, btw, is not typically going to be normal when in real production, since it is rare you will start 800 machines/agents at all roughly the same time or within minutes apart from each other). Anyway, this immense traffic will cause the concurrency limits to kick in and you probably have to sit and wait for steady state to come before you see things calm down.
Check your server logs and see what concurrency limits are getting triggered. They come out at the DEBUG level, the jboss-log4j.xml categories are:
<!-- details on incoming remote pojo invocations; emits concurrency limit debug messages --> <!-- <category name="org.rhq.enterprise.communications.command.impl.remotepojo.server.RemotePojoInvocationCommandService"> <priority value="DEBUG"/> </category> -->
<!-- emits global concurrency limit debug messages --> <!-- <category name="org.rhq.enterprise.communications.GlobalConcurrencyLimitCommandListener"> <priority value="DEBUG"/> </category> -->
All that said, I think you are hitting your upper limit on the number of agents per server. It would not surprise me that you simply must increase the number of servers in your HA setup.
On 02/22/2011 03:03 PM, Bala Nair wrote:
We are attempting to do some performance testing of rhq using agent copy and the perftest plugin on a 2 server HA system. Our initial tests of 50 agents per server up to 200 per server went well. We saw the expected behavior of increased network traffic and database inserts, still capturing all the data with no errors in the server logs. This past weekend we scaled the system to 400 agents per server, at which point things started to fall over. We started getting the following postgres errors in the server log and all our platforms were marked as not available.
2011-02-22 11:42:22,616 WARN [org.hibernate.util.JDBCExceptionReporter] SQL Error: 0, SQLState: null 2011-02-22 11:42:22,616 ERROR [org.hibernate.util.JDBCExceptionReporter] Could not create connection; - nested throwable: (org.postgresql.util.PSQLException: FATAL: sorry, too many clients already); - nested throwable: (org.jboss.resource.JBossResourceException: Could not create connection;
- nested throwable: (org.postgresql.util.PSQLException: FATAL: sorry,
too many clients already)) 2011-02-22 11:42:22,617 INFO [org.rhq.enterprise.server.discovery.DiscoveryServerServiceImpl] Error processing availability report from [agentcopy-53046]: javax.ejb.EJBTransactionRolledbackException:org.hibernate.exception.GenericJDBCException: Cannot open connection -> javax.persistence.PersistenceException:org.hibernate.exception.GenericJDBCException: Cannot open connection -> org.hibernate.exception.GenericJDBCException:Cannot open connection -> org.jboss.util.NestedSQLException:Could not create connection; - nested throwable: (org.postgresql.util.PSQLException: FATAL: sorry, too many clients already); - nested throwable: (org.jboss.resource.JBossResourceException: Could not create connection;
- nested throwable: (org.postgresql.util.PSQLException: FATAL: sorry,
too many clients already))[SQLException=Could not create connection; - nested throwable: (org.postgresql.util.PSQLException: FATAL: sorry, too many clients already); - nested throwable: (org.jboss.resource.JBossResourceException: Could not create connection;
- nested throwable: (org.postgresql.util.PSQLException: FATAL: sorry,
too many clients already))] -> org.jboss.resource.JBossResourceException:Could not create connection; - nested throwable: (org.postgresql.util.PSQLException: FATAL: sorry, too many clients already) -> org.postgresql.util.PSQLException:FATAL: sorry, too many clients already[SQLException=FATAL: sorry, too many clients already] 2011-02-22 11:42:22,715 WARN [org.jboss.resource.connectionmanager.JBossManagedConnectionPool] Throwable while attempting to get a new connection: null org.jboss.resource.JBossResourceException: Could not create connection;
- nested throwable: (org.postgresql.util.PSQLException: FATAL: sorry,
too many clients already) at org.jboss.resource.adapter.jdbc.xa.XAManagedConnectionFactory.createManagedConnection(XAManagedConnectionFactory.java:155) at org.jboss.resource.connectionmanager.InternalManagedConnectionPool.createConnectionEventListener(InternalManagedConnectionPool.java:619) at org.jboss.resource.connectionmanager.InternalManagedConnectionPool.getConnection(InternalManagedConnectionPool.java:264) at org.jboss.resource.connectionmanager.JBossManagedConnectionPool$BasePool.getConnection(JBossManagedConnectionPool.java:613) at org.jboss.resource.connectionmanager.BaseConnectionManager2.getManagedConnection(BaseConnectionManager2.java:347) at org.jboss.resource.connectionmanager.TxConnectionManager.getManagedConnection(TxConnectionManager.java:330) at org.jboss.resource.connectionmanager.BaseConnectionManager2.allocateConnection(BaseConnectionManager2.java:402) at org.jboss.resource.connectionmanager.BaseConnectionManager2$ConnectionManagerProxy.allocateConnection(BaseConnectionManager2.java:849) at org.jboss.resource.adapter.jdbc.WrapperDataSource.getConnection(WrapperDataSource.java:89) at org.hibernate.ejb.connection.InjectedDataSourceConnectionProvider.getConnection(InjectedDataSourceConnectionProvider.java:47) at org.hibernate.jdbc.ConnectionManager.openConnection(ConnectionManager.java:423) at org.hibernate.jdbc.ConnectionManager.getConnection(ConnectionManager.java:144) at org.hibernate.jdbc.AbstractBatcher.prepareQueryStatement(AbstractBatcher.java:140) at org.hibernate.loader.Loader.prepareQueryStatement(Loader.java:1547) at org.hibernate.loader.Loader.doQuery(Loader.java:673) at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236) at org.hibernate.loader.Loader.doList(Loader.java:2213) at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2104) at org.hibernate.loader.Loader.list(Loader.java:2099) at org.hibernate.loader.hql.QueryLoader.list(QueryLoader.java:378) at org.hibernate.hql.ast.QueryTranslatorImpl.list(QueryTranslatorImpl.java:338) at org.hibernate.engine.query.HQLQueryPlan.performList(HQLQueryPlan.java:172) at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1121) at org.hibernate.impl.QueryImpl.list(QueryImpl.java:79) at org.hibernate.ejb.QueryImpl.getSingleResult(QueryImpl.java:80) at org.rhq.enterprise.server.core.AgentManagerBean.getAgentByAgentToken(AgentManagerBean.java:306) at sun.reflect.GeneratedMethodAccessor261.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:112) at org.jboss.ejb3.interceptor.InvocationContextImpl.proceed(InvocationContextImpl.java:166) at org.jboss.ejb3.interceptor.EJB3InterceptorsInterceptor.invoke(EJB3InterceptorsInterceptor.java:63) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.ejb3.entity.TransactionScopedEntityManagerInterceptor.invoke(TransactionScopedEntityManagerInterceptor.java:54) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.ejb3.AllowedOperationsInterceptor.invoke(AllowedOperationsInterceptor.java:47) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.aspects.tx.TxPolicy.invokeInOurTx(TxPolicy.java:79) at org.jboss.aspects.tx.TxInterceptor$Required.invoke(TxInterceptor.java:191) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.aspects.tx.TxPropagationInterceptor.invoke(TxPropagationInterceptor.java:95) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.ejb3.stateless.StatelessInstanceInterceptor.invoke(StatelessInstanceInterceptor.java:62) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.aspects.security.AuthenticationInterceptor.invoke(AuthenticationInterceptor.java:77) at org.jboss.ejb3.security.Ejb3AuthenticationInterceptor.invoke(Ejb3AuthenticationInterceptor.java:110) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.ejb3.ENCPropagationInterceptor.invoke(ENCPropagationInterceptor.java:46) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.ejb3.asynchronous.AsynchronousInterceptor.invoke(AsynchronousInterceptor.java:106) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.ejb3.stateless.StatelessContainer.localInvoke(StatelessContainer.java:240) at org.jboss.ejb3.stateless.StatelessContainer.localInvoke(StatelessContainer.java:210) at org.jboss.ejb3.stateless.StatelessLocalProxy.invoke(StatelessLocalProxy.java:84) at $Proxy263.getAgentByAgentToken(Unknown Source) at org.rhq.enterprise.server.core.comm.SecurityTokenCommandAuthenticator.isAuthenticated(SecurityTokenCommandAuthenticator.java:96) at org.rhq.enterprise.communications.command.server.CommandProcessor.handleIncomingInvocationRequest(CommandProcessor.java:246) at org.rhq.enterprise.communications.command.server.CommandProcessor.invoke(CommandProcessor.java:184) at org.jboss.remoting.ServerInvoker.invoke(ServerInvoker.java:809) at org.jboss.remoting.transport.servlet.ServletServerInvoker.processRequest(ServletServerInvoker.java:232) at sun.reflect.GeneratedMethodAccessor202.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.jboss.mx.interceptor.ReflectedDispatcher.invoke(ReflectedDispatcher.java:155) at org.jboss.mx.server.Invocation.dispatch(Invocation.java:94) at org.jboss.mx.server.Invocation.invoke(Invocation.java:86) at org.jboss.mx.server.AbstractMBeanInvoker.invoke(AbstractMBeanInvoker.java:264) at org.jboss.mx.server.MBeanServerImpl.invoke(MBeanServerImpl.java:659) at javax.management.MBeanServerInvocationHandler.invoke(MBeanServerInvocationHandler.java:288) at $Proxy421.processRequest(Unknown Source) at org.jboss.remoting.transport.servlet.web.ServerInvokerServlet.processRequest(ServerInvokerServlet.java:128) at org.jboss.remoting.transport.servlet.web.ServerInvokerServlet.doPost(ServerInvokerServlet.java:157) at javax.servlet.http.HttpServlet.service(HttpServlet.java:710) at javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.rhq.helpers.rtfilter.filter.RtFilter.doFilter(RtFilter.java:124) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182) at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:446) at java.lang.Thread.run(Thread.java:619) Caused by: org.postgresql.util.PSQLException: FATAL: sorry, too many clients already at org.postgresql.core.v3.ConnectionFactoryImpl.readStartupMessages(ConnectionFactoryImpl.java:464) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:112) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:66) at org.postgresql.jdbc2.AbstractJdbc2Connection.<init>(AbstractJdbc2Connection.java:125) at org.postgresql.jdbc3.AbstractJdbc3Connection.<init>(AbstractJdbc3Connection.java:30) at org.postgresql.jdbc3.Jdbc3Connection.<init>(Jdbc3Connection.java:24) at org.postgresql.Driver.makeConnection(Driver.java:393) at org.postgresql.Driver.connect(Driver.java:267) at java.sql.DriverManager.getConnection(DriverManager.java:582) at java.sql.DriverManager.getConnection(DriverManager.java:185) at org.postgresql.ds.common.BaseDataSource.getConnection(BaseDataSource.java:87) at org.postgresql.xa.PGXADataSource.getXAConnection(PGXADataSource.java:47) at org.jboss.resource.adapter.jdbc.xa.XAManagedConnectionFactory.createManagedConnection(XAManagedConnectionFactory.java:137) ... 94 more
At this point I figured we had exceeded the total number of available connections to the database so I increased the max connections in postgresql.conf from 60 to 120. This seemed to work for about an hour, but now I'm seeing a different sql error in the server log:
2011-02-22 14:38:36,617 WARN [org.hibernate.util.JDBCExceptionReporter] SQL Error: 0, SQLState: null 2011-02-22 14:38:36,617 ERROR [org.hibernate.util.JDBCExceptionReporter] Could not create connection; - nested throwable: (org.postgresql.util.PSQLException: FATAL: connection limit exceeded for non-superusers); - nested throwable: (org.jboss.resource.JBossResourceException: Could not create connection;
- nested throwable: (org.postgresql.util.PSQLException: FATAL:
connection limit exceeded for non-superusers)) 2011-02-22 14:38:37,911 WARN [org.rhq.enterprise.communications.command.server.CommandProcessor] {CommandProcessor.failed-authentication}Command failed to be authenticated! This command will be ignored and not processed: Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.guaranteed-delivery=true, rhq.agent-name=agentcopy-52019, rhq.security-token=qx2pzmh7AX/RILHfKQCOuEq6ILOAtLQSRefrX0SzpRrLw541w1ey3ZJIr3onEd2XOlc=, rhq.failover-attempts=4, rhq.externalizable-strategy=AGENT, rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[mergeMeasurementReport], targetInterfaceName=org.rhq.core.clientapi.server.measurement.MeasurementServerService}] 2011-02-22 14:38:38,472 INFO [org.rhq.enterprise.server.core.CoreServerServiceImpl] Agent [agentcopy-40003][3.0.0(59e9341)] would like to connect to this server 2011-02-22 14:38:38,484 WARN [org.jboss.resource.connectionmanager.JBossManagedConnectionPool] Throwable while attempting to get a new connection: null org.jboss.resource.JBossResourceException: Could not create connection; - nested throwable: (org.postgresql.util.PSQLException: FATAL: connection limit exceeded for non-superusers)
I understand these are warnings, probably happening because all 800 agents suddenly saw an available server, but shouldn't the server's global concurrency limit prevent excess requests from the agents from being attempted? These are already inventoried agents trying to synchronize inventories, send availablity reports, etc. Are there any tuning parameters I can tweak to make this cleaner? I'm letting the system run hoping that it will return to a steady state eventually. Have we just exceeded the number of agents per server that the system can comfortably handle and we should scale back to 50-100?
Bala Nair SeaChange International
rhq-users mailing list rhq-users@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/rhq-users
As I'm sure you know the global concurrency limits will only limit the number of agents connecting to a server, not the number of connections made by the server to the DB. IIRC when testing this on Oracle we were seeing the # of db connections used > # of agents. You've also got the all the background jobs and the UI itself using DB connections. So if your database server can take it, try doubling the max_connections.
Mazz also pointed out docs on this: http://rhq-project.org/display/JOPR2/Communications+Configuration#Communicat...
If you do add another JON node, you'll definitely need to up the max_connections because you are increasing the potential number of connections made to the db by 33%
As mazz mentioned, you can stress the db in several directions i) lots of agents, few metrics => lots of connections ii) few agents, lots of metrics => lots of work in the db iii) lots of agents, lots of metrics => lots of connections and lots of db work
each of which will manifest themselves in different way. It may be helpful to run "select * from pg_stat_activity;" if you hit this situation again and see what the various connections are upto.
Let us know how you get on Cheers Charles ----- Original Message -----
Depending on the backend, 400 agents to a single server may be too much. That said, I remember about a year ago someone tested using a database with SSD storage and it was blazing fast and probably could have handled more than 400 agents per server. It (mainly) depends on the database - how well it is tuned and how fast the disks are. The number of managed resources also has an impact - if your 800 agents are only monitoring, say, only the operating system and its child services, it wouldn't be as bad as if all 800 are monitoring a couple JBossAS instances each (the difference is the number of measurement reports that each agent would be sending up to the server).
Anyway, aside from that, global concurrency limits should help. But they only delay the agents reports getting into the server. Once a concurrency limit is hit, the agents are told to "hold" for a few seconds (up to I think a minute or so) and then try the message again and if there is room, the message goes through (otherwise, the concurrency limit kicks in again and the agent waits again). Its essentially a throttling mechanism.
If you are starting all your 800 agents at roughly the same time (using the agentspawn stuff, you normally launch them about 1m apart) then the server's will get clobbered with all of their synch requests and initial inventory reports. This causes immense traffic (which, btw, is not typically going to be normal when in real production, since it is rare you will start 800 machines/agents at all roughly the same time or within minutes apart from each other). Anyway, this immense traffic will cause the concurrency limits to kick in and you probably have to sit and wait for steady state to come before you see things calm down.
Check your server logs and see what concurrency limits are getting triggered. They come out at the DEBUG level, the jboss-log4j.xml categories are:
<!-- details on incoming remote pojo invocations; emits concurrency limit debug messages -->
<!-- <category name="org.rhq.enterprise.communications.command.impl.remotepojo.server.RemotePojoInvocationCommandService"> <priority value="DEBUG"/> </category> -->
<!-- emits global concurrency limit debug messages -->
<!-- <category name="org.rhq.enterprise.communications.GlobalConcurrencyLimitCommandListener"> <priority value="DEBUG"/> </category> -->
All that said, I think you are hitting your upper limit on the number of agents per server. It would not surprise me that you simply must increase the number of servers in your HA setup.
On 02/22/2011 03:03 PM, Bala Nair wrote:
We are attempting to do some performance testing of rhq using agent copy and the perftest plugin on a 2 server HA system. Our initial tests of 50 agents per server up to 200 per server went well. We saw the expected behavior of increased network traffic and database inserts, still capturing all the data with no errors in the server logs. This past weekend we scaled the system to 400 agents per server, at which point things started to fall over. We started getting the following postgres errors in the server log and all our platforms were marked as not available.
2011-02-22 11:42:22,616 WARN [org.hibernate.util.JDBCExceptionReporter] SQL Error: 0, SQLState: null 2011-02-22 11:42:22,616 ERROR [org.hibernate.util.JDBCExceptionReporter] Could not create connection; - nested throwable: (org.postgresql.util.PSQLException: FATAL: sorry, too many clients already); - nested throwable: (org.jboss.resource.JBossResourceException: Could not create connection;
- nested throwable: (org.postgresql.util.PSQLException: FATAL:
sorry, too many clients already)) 2011-02-22 11:42:22,617 INFO [org.rhq.enterprise.server.discovery.DiscoveryServerServiceImpl] Error processing availability report from [agentcopy-53046]: javax.ejb.EJBTransactionRolledbackException:org.hibernate.exception.GenericJDBCException: Cannot open connection -> javax.persistence.PersistenceException:org.hibernate.exception.GenericJDBCException: Cannot open connection -> org.hibernate.exception.GenericJDBCException:Cannot open connection -> org.jboss.util.NestedSQLException:Could not create connection; - nested throwable: (org.postgresql.util.PSQLException: FATAL: sorry, too many clients already); - nested throwable: (org.jboss.resource.JBossResourceException: Could not create connection;
- nested throwable: (org.postgresql.util.PSQLException: FATAL:
sorry, too many clients already))[SQLException=Could not create connection;
nested throwable: (org.postgresql.util.PSQLException: FATAL: sorry, too many clients already); - nested throwable: (org.jboss.resource.JBossResourceException: Could not create connection;
- nested throwable: (org.postgresql.util.PSQLException: FATAL:
sorry, too many clients already))] -> org.jboss.resource.JBossResourceException:Could not create connection; - nested throwable: (org.postgresql.util.PSQLException: FATAL: sorry, too many clients already) -> org.postgresql.util.PSQLException:FATAL: sorry, too many clients already[SQLException=FATAL: sorry, too many clients already] 2011-02-22 11:42:22,715 WARN [org.jboss.resource.connectionmanager.JBossManagedConnectionPool] Throwable while attempting to get a new connection: null org.jboss.resource.JBossResourceException: Could not create connection;
- nested throwable: (org.postgresql.util.PSQLException: FATAL:
sorry, too many clients already) at org.jboss.resource.adapter.jdbc.xa.XAManagedConnectionFactory.createManagedConnection(XAManagedConnectionFactory.java:155) at org.jboss.resource.connectionmanager.InternalManagedConnectionPool.createConnectionEventListener(InternalManagedConnectionPool.java:619) at org.jboss.resource.connectionmanager.InternalManagedConnectionPool.getConnection(InternalManagedConnectionPool.java:264) at org.jboss.resource.connectionmanager.JBossManagedConnectionPool$BasePool.getConnection(JBossManagedConnectionPool.java:613) at org.jboss.resource.connectionmanager.BaseConnectionManager2.getManagedConnection(BaseConnectionManager2.java:347) at org.jboss.resource.connectionmanager.TxConnectionManager.getManagedConnection(TxConnectionManager.java:330) at org.jboss.resource.connectionmanager.BaseConnectionManager2.allocateConnection(BaseConnectionManager2.java:402) at org.jboss.resource.connectionmanager.BaseConnectionManager2$ConnectionManagerProxy.allocateConnection(BaseConnectionManager2.java:849) at org.jboss.resource.adapter.jdbc.WrapperDataSource.getConnection(WrapperDataSource.java:89) at org.hibernate.ejb.connection.InjectedDataSourceConnectionProvider.getConnection(InjectedDataSourceConnectionProvider.java:47) at org.hibernate.jdbc.ConnectionManager.openConnection(ConnectionManager.java:423) at org.hibernate.jdbc.ConnectionManager.getConnection(ConnectionManager.java:144) at org.hibernate.jdbc.AbstractBatcher.prepareQueryStatement(AbstractBatcher.java:140) at org.hibernate.loader.Loader.prepareQueryStatement(Loader.java:1547) at org.hibernate.loader.Loader.doQuery(Loader.java:673) at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236) at org.hibernate.loader.Loader.doList(Loader.java:2213) at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2104) at org.hibernate.loader.Loader.list(Loader.java:2099) at org.hibernate.loader.hql.QueryLoader.list(QueryLoader.java:378) at org.hibernate.hql.ast.QueryTranslatorImpl.list(QueryTranslatorImpl.java:338) at org.hibernate.engine.query.HQLQueryPlan.performList(HQLQueryPlan.java:172) at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1121) at org.hibernate.impl.QueryImpl.list(QueryImpl.java:79) at org.hibernate.ejb.QueryImpl.getSingleResult(QueryImpl.java:80) at org.rhq.enterprise.server.core.AgentManagerBean.getAgentByAgentToken(AgentManagerBean.java:306) at sun.reflect.GeneratedMethodAccessor261.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:112) at org.jboss.ejb3.interceptor.InvocationContextImpl.proceed(InvocationContextImpl.java:166) at org.jboss.ejb3.interceptor.EJB3InterceptorsInterceptor.invoke(EJB3InterceptorsInterceptor.java:63) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.ejb3.entity.TransactionScopedEntityManagerInterceptor.invoke(TransactionScopedEntityManagerInterceptor.java:54) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.ejb3.AllowedOperationsInterceptor.invoke(AllowedOperationsInterceptor.java:47) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.aspects.tx.TxPolicy.invokeInOurTx(TxPolicy.java:79) at org.jboss.aspects.tx.TxInterceptor$Required.invoke(TxInterceptor.java:191) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.aspects.tx.TxPropagationInterceptor.invoke(TxPropagationInterceptor.java:95) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.ejb3.stateless.StatelessInstanceInterceptor.invoke(StatelessInstanceInterceptor.java:62) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.aspects.security.AuthenticationInterceptor.invoke(AuthenticationInterceptor.java:77) at org.jboss.ejb3.security.Ejb3AuthenticationInterceptor.invoke(Ejb3AuthenticationInterceptor.java:110) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.ejb3.ENCPropagationInterceptor.invoke(ENCPropagationInterceptor.java:46) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.ejb3.asynchronous.AsynchronousInterceptor.invoke(AsynchronousInterceptor.java:106) at org.jboss.aop.joinpoint.MethodInvocation.invokeNext(MethodInvocation.java:101) at org.jboss.ejb3.stateless.StatelessContainer.localInvoke(StatelessContainer.java:240) at org.jboss.ejb3.stateless.StatelessContainer.localInvoke(StatelessContainer.java:210) at org.jboss.ejb3.stateless.StatelessLocalProxy.invoke(StatelessLocalProxy.java:84) at $Proxy263.getAgentByAgentToken(Unknown Source) at org.rhq.enterprise.server.core.comm.SecurityTokenCommandAuthenticator.isAuthenticated(SecurityTokenCommandAuthenticator.java:96) at org.rhq.enterprise.communications.command.server.CommandProcessor.handleIncomingInvocationRequest(CommandProcessor.java:246) at org.rhq.enterprise.communications.command.server.CommandProcessor.invoke(CommandProcessor.java:184) at org.jboss.remoting.ServerInvoker.invoke(ServerInvoker.java:809) at org.jboss.remoting.transport.servlet.ServletServerInvoker.processRequest(ServletServerInvoker.java:232) at sun.reflect.GeneratedMethodAccessor202.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.jboss.mx.interceptor.ReflectedDispatcher.invoke(ReflectedDispatcher.java:155) at org.jboss.mx.server.Invocation.dispatch(Invocation.java:94) at org.jboss.mx.server.Invocation.invoke(Invocation.java:86) at org.jboss.mx.server.AbstractMBeanInvoker.invoke(AbstractMBeanInvoker.java:264) at org.jboss.mx.server.MBeanServerImpl.invoke(MBeanServerImpl.java:659) at javax.management.MBeanServerInvocationHandler.invoke(MBeanServerInvocationHandler.java:288) at $Proxy421.processRequest(Unknown Source) at org.jboss.remoting.transport.servlet.web.ServerInvokerServlet.processRequest(ServerInvokerServlet.java:128) at org.jboss.remoting.transport.servlet.web.ServerInvokerServlet.doPost(ServerInvokerServlet.java:157) at javax.servlet.http.HttpServlet.service(HttpServlet.java:710) at javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.rhq.helpers.rtfilter.filter.RtFilter.doFilter(RtFilter.java:124) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:182) at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:446) at java.lang.Thread.run(Thread.java:619) Caused by: org.postgresql.util.PSQLException: FATAL: sorry, too many clients already at org.postgresql.core.v3.ConnectionFactoryImpl.readStartupMessages(ConnectionFactoryImpl.java:464) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:112) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:66) at org.postgresql.jdbc2.AbstractJdbc2Connection.<init>(AbstractJdbc2Connection.java:125) at org.postgresql.jdbc3.AbstractJdbc3Connection.<init>(AbstractJdbc3Connection.java:30) at org.postgresql.jdbc3.Jdbc3Connection.<init>(Jdbc3Connection.java:24) at org.postgresql.Driver.makeConnection(Driver.java:393) at org.postgresql.Driver.connect(Driver.java:267) at java.sql.DriverManager.getConnection(DriverManager.java:582) at java.sql.DriverManager.getConnection(DriverManager.java:185) at org.postgresql.ds.common.BaseDataSource.getConnection(BaseDataSource.java:87) at org.postgresql.xa.PGXADataSource.getXAConnection(PGXADataSource.java:47) at org.jboss.resource.adapter.jdbc.xa.XAManagedConnectionFactory.createManagedConnection(XAManagedConnectionFactory.java:137) ... 94 more
At this point I figured we had exceeded the total number of available connections to the database so I increased the max connections in postgresql.conf from 60 to 120. This seemed to work for about an hour, but now I'm seeing a different sql error in the server log:
2011-02-22 14:38:36,617 WARN [org.hibernate.util.JDBCExceptionReporter] SQL Error: 0, SQLState: null 2011-02-22 14:38:36,617 ERROR [org.hibernate.util.JDBCExceptionReporter] Could not create connection; - nested throwable: (org.postgresql.util.PSQLException: FATAL: connection limit exceeded for non-superusers); - nested throwable: (org.jboss.resource.JBossResourceException: Could not create connection;
- nested throwable: (org.postgresql.util.PSQLException: FATAL:
connection limit exceeded for non-superusers)) 2011-02-22 14:38:37,911 WARN [org.rhq.enterprise.communications.command.server.CommandProcessor] {CommandProcessor.failed-authentication}Command failed to be authenticated! This command will be ignored and not processed: Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.guaranteed-delivery=true, rhq.agent-name=agentcopy-52019, rhq.security-token=qx2pzmh7AX/RILHfKQCOuEq6ILOAtLQSRefrX0SzpRrLw541w1ey3ZJIr3onEd2XOlc=, rhq.failover-attempts=4, rhq.externalizable-strategy=AGENT, rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[mergeMeasurementReport], targetInterfaceName=org.rhq.core.clientapi.server.measurement.MeasurementServerService}] 2011-02-22 14:38:38,472 INFO [org.rhq.enterprise.server.core.CoreServerServiceImpl] Agent [agentcopy-40003][3.0.0(59e9341)] would like to connect to this server 2011-02-22 14:38:38,484 WARN [org.jboss.resource.connectionmanager.JBossManagedConnectionPool] Throwable while attempting to get a new connection: null org.jboss.resource.JBossResourceException: Could not create connection; - nested throwable: (org.postgresql.util.PSQLException: FATAL: connection limit exceeded for non-superusers)
I understand these are warnings, probably happening because all 800 agents suddenly saw an available server, but shouldn't the server's global concurrency limit prevent excess requests from the agents from being attempted? These are already inventoried agents trying to synchronize inventories, send availablity reports, etc. Are there any tuning parameters I can tweak to make this cleaner? I'm letting the system run hoping that it will return to a steady state eventually. Have we just exceeded the number of agents per server that the system can comfortably handle and we should scale back to 50-100?
Bala Nair SeaChange International
rhq-users mailing list rhq-users@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/rhq-users
Things seemed to have settled down - we doubled the number of max connections in postgresql.conf and started up the agents more slowly. We still saw the connection limit exceeded error, but it stopped after a while and we seem to be operating normally again. So we have done scenario (i) listed below and 200 per server is probably the max we want to run with. Next is to test scenario (iii) - my plan is to model our plugins with scenarios for the perftest plugin that mimic the number and frequency of data collection from our real plugins.
Thanks for all the help.
Bala Nair SeaChange International
On 2/22/11 4:21 PM, Charles Crouch wrote:
As I'm sure you know the global concurrency limits will only limit the number of agents connecting to a server, not the number of connections made by the server to the DB. IIRC when testing this on Oracle we were seeing the # of db connections used> # of agents. You've also got the all the background jobs and the UI itself using DB connections. So if your database server can take it, try doubling the max_connections.
Mazz also pointed out docs on this: http://rhq-project.org/display/JOPR2/Communications+Configuration#Communicat...
If you do add another JON node, you'll definitely need to up the max_connections because you are increasing the potential number of connections made to the db by 33%
As mazz mentioned, you can stress the db in several directions i) lots of agents, few metrics => lots of connections ii) few agents, lots of metrics => lots of work in the db iii) lots of agents, lots of metrics => lots of connections and lots of db work
each of which will manifest themselves in different way. It may be helpful to run "select * from pg_stat_activity;" if you hit this situation again and see what the various connections are upto.
Let us know how you get on Cheers Charles ----- Original Message -----
Depending on the backend, 400 agents to a single server may be too much. That said, I remember about a year ago someone tested using a database with SSD storage and it was blazing fast and probably could have handled more than 400 agents per server. It (mainly) depends on the database - how well it is tuned and how fast the disks are. The number of managed resources also has an impact - if your 800 agents are only monitoring, say, only the operating system and its child services, it wouldn't be as bad as if all 800 are monitoring a couple JBossAS instances each (the difference is the number of measurement reports that each agent would be sending up to the server).
Anyway, aside from that, global concurrency limits should help. But they only delay the agents reports getting into the server. Once a concurrency limit is hit, the agents are told to "hold" for a few seconds (up to I think a minute or so) and then try the message again and if there is room, the message goes through (otherwise, the concurrency limit kicks in again and the agent waits again). Its essentially a throttling mechanism.
If you are starting all your 800 agents at roughly the same time (using the agentspawn stuff, you normally launch them about 1m apart) then the server's will get clobbered with all of their synch requests and initial inventory reports. This causes immense traffic (which, btw, is not typically going to be normal when in real production, since it is rare you will start 800 machines/agents at all roughly the same time or within minutes apart from each other). Anyway, this immense traffic will cause the concurrency limits to kick in and you probably have to sit and wait for steady state to come before you see things calm down.
Check your server logs and see what concurrency limits are getting triggered. They come out at the DEBUG level, the jboss-log4j.xml categories are:
<!-- details on incoming remote pojo invocations; emits concurrency limit debug messages -->
<!-- <category name="org.rhq.enterprise.communications.command.impl.remotepojo.server.RemotePojoInvocationCommandService"> <priority value="DEBUG"/> </category> -->
<!-- emits global concurrency limit debug messages -->
<!-- <category name="org.rhq.enterprise.communications.GlobalConcurrencyLimitCommandListener"> <priority value="DEBUG"/> </category> -->
All that said, I think you are hitting your upper limit on the number of agents per server. It would not surprise me that you simply must increase the number of servers in your HA setup.
rhq-users@lists.stg.fedorahosted.org