Postgres database stopped again... I guess the problem was the enormous number of open files...
We use 9.2, and we also use the postgres plugin - which still doesn't really support 9.2 as far as I know.
Altogether Postgres (only the one storing the RHQ database) can only run for a few days, than the number of open files and the open connections raise really high, and finally I should restart it.
Usually I restart RHQ as well, I generally have no time to play with "what to restart".
The log is practically endless, I can quote, but I think I should know something to look for in it.
Regards,
Attila
Attila,
Am 23.09.2013 um 10:52 schrieb Attila Heidrich:
Postgres database stopped again... I guess the problem was the enormous number of open files...
We use 9.2, and we also use the postgres plugin - which still doesn't really support 9.2 as far as I know.
Can you please disable that plugin to see if this may cause the issue? To what value did you set max connections in postgresql.conf
max_connections = 100
John found an issue recently where the baseline calculation did not close connections, but this should not create insane amounts of connections, as the appserver is supposed to limit them.
Sorry for the inconvenience.
If that happens again, can you do a ps auxw| and grep for processes like these:
postgres 6087 0,0 0,1 2487132 5312 ?? Ss 11:28am 0:00.08 postgres: rhqadmin rhqdev 127.0.0.1(65248) idle postgres 6081 0,0 0,1 2487132 4308 ?? Ss 11:28am 0:00.01 postgres: rhqadmin rhqdev 127.0.0.1(65243) idle
That may help us diagnose what is going wrong.
Heiko
I have "uninventoried" the postgres plugin for most of the servers. By the way there's only a single server which suffers from this issue, and this is the one what is used by RHQ itself! The others - which are only monitored by the agent, are quite fine. The only problem is with the 9.2 version that some system tables has been changed with 9.2, so there's not current_query anymore.
This is strange, that in postgres log there's still lines like this: 2013-09-23 13:43:04 CEST ERROR: column "current_query" does not exist at character 70 2013-09-23 13:43:04 CEST STATEMENT: SELECT (SELECT COUNT(*) FROM pg_stat_activity where usename = $1 AND current_query != '<IDLE>') AS active, (SELECT COUNT(*) FROM pg_stat_activity WHERE usename = $2) AS total 2013-09-23 13:43:44 CEST FATAL: password authentication failed for user "postgres" 2013-09-23 13:43:46 CEST FATAL: password authentication failed for user "postgres"
I guess uninventory is not the perfect way to disable the plugin...
max_connection is the default 100 for all the servers.
Whenever it happen again, I will collect more Pg statistics!
Regards,
Attila
2013/9/23 Attila Heidrich attila.heidrich@gmail.com
Postgres database stopped again... I guess the problem was the enormous number of open files...
We use 9.2, and we also use the postgres plugin - which still doesn't really support 9.2 as far as I know.
Altogether Postgres (only the one storing the RHQ database) can only run for a few days, than the number of open files and the open connections raise really high, and finally I should restart it.
Usually I restart RHQ as well, I generally have no time to play with "what to restart".
The log is practically endless, I can quote, but I think I should know something to look for in it.
Regards,
Attila
And just one more interesting thing:
After I correct the agents authentication, and it can reach the system tables, and report all the stuff, I still have got lines: 2013-09-23 14:28:56 CEST FATAL: password authentication failed for user "postgres"
Until I set up the access parameters, I have got only such lines, but after setting it up, I have got this:
2013-09-23 14:28:17 CEST ERROR: column "current_query" does not exist at character 70 2013-09-23 14:28:17 CEST STATEMENT: SELECT (SELECT COUNT(*) FROM pg_stat_activity where usename = $1 AND current_query != '<IDLE>') AS active, (SELECT COUNT(*) FROM pg_stat_activity WHERE usename = $2) AS total
and less frequently this: 2013-09-23 14:28:56 CEST FATAL: password authentication failed for user "postgres"
Attila
2013/9/23 Attila Heidrich attila.heidrich@gmail.com
I have "uninventoried" the postgres plugin for most of the servers. By the way there's only a single server which suffers from this issue, and this is the one what is used by RHQ itself! The others - which are only monitored by the agent, are quite fine. The only problem is with the 9.2 version that some system tables has been changed with 9.2, so there's not current_query anymore.
This is strange, that in postgres log there's still lines like this: 2013-09-23 13:43:04 CEST ERROR: column "current_query" does not exist at character 70 2013-09-23 13:43:04 CEST STATEMENT: SELECT (SELECT COUNT(*) FROM pg_stat_activity where usename = $1 AND current_query != '<IDLE>') AS active, (SELECT COUNT(*) FROM pg_stat_activity WHERE usename = $2) AS total 2013-09-23 13:43:44 CEST FATAL: password authentication failed for user "postgres" 2013-09-23 13:43:46 CEST FATAL: password authentication failed for user "postgres"
I guess uninventory is not the perfect way to disable the plugin...
max_connection is the default 100 for all the servers.
Whenever it happen again, I will collect more Pg statistics!
Regards,
Attila
2013/9/23 Attila Heidrich attila.heidrich@gmail.com
Postgres database stopped again... I guess the problem was the enormous number of open files...
We use 9.2, and we also use the postgres plugin - which still doesn't really support 9.2 as far as I know.
Altogether Postgres (only the one storing the RHQ database) can only run for a few days, than the number of open files and the open connections raise really high, and finally I should restart it.
Usually I restart RHQ as well, I generally have no time to play with "what to restart".
The log is practically endless, I can quote, but I think I should know something to look for in it.
Regards,
Attila
OS is Debian Wheezy (64bit) Pg: 9.2.4-1.pgdg70+1 RHQ is 4.9
Other hosts running the very same versions, the agent-only configs have no problem at all (only the agents' problem with Pg 9.2). There are also other Linux versions and a few Windows config (XP and 2008 server), most of the platform runs Pg as well.
Attila
Le 24/09/2013 12:46, Attila Heidrich a écrit :
OS is Debian Wheezy (64bit) Pg: 9.2.4-1.pgdg70+1 RHQ is 4.9
Other hosts running the very same versions, the agent-only configs have no problem at all (only the agents' problem with Pg 9.2). There are also other Linux versions and a few Windows config (XP and 2008 server), most of the platform runs Pg as well.
Attila
rhq-users mailing list rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
I was talking about the platform running your RHQ database, not the platforms running managed Postgres servers.
What is the amount reported by "ulimit -n"?
It is very easy to reach 10 000 open files (e.g. if you have 20 opened connections and each have 500 open files)
Thomas
The version data below belongs to the server which runs the RHQ server and the RHQ database.
I understand it's easy to reach the 10K open files, I can also accept if it is normal, but what are the recommended settings then? Other Pg servers which don't run the RHQ server works for a long time with higher "active backend" amount and much smaller "open files" amount!
root@ct-front:/etc/postgresql/9.2/main# ulimit -n 1024
Attila
2013/9/24 Attila Heidrich attila.heidrich@gmail.com
OS is Debian Wheezy (64bit) Pg: 9.2.4-1.pgdg70+1 RHQ is 4.9
Other hosts running the very same versions, the agent-only configs have no problem at all (only the agents' problem with Pg 9.2). There are also other Linux versions and a few Windows config (XP and 2008 server), most of the platform runs Pg as well.
Attila
Currently I have to restart the RHQ server twice a day with a cron job. Can I restart the server itself with a script triggered by some alert?
Attila
2013/9/24 Attila Heidrich attila.heidrich@gmail.com
The version data below belongs to the server which runs the RHQ server and the RHQ database.
I understand it's easy to reach the 10K open files, I can also accept if it is normal, but what are the recommended settings then? Other Pg servers which don't run the RHQ server works for a long time with higher "active backend" amount and much smaller "open files" amount!
root@ct-front:/etc/postgresql/9.2/main# ulimit -n 1024
Attila
2013/9/24 Attila Heidrich attila.heidrich@gmail.com
OS is Debian Wheezy (64bit) Pg: 9.2.4-1.pgdg70+1 RHQ is 4.9
Other hosts running the very same versions, the agent-only configs have no problem at all (only the agents' problem with Pg 9.2). There are also other Linux versions and a few Windows config (XP and 2008 server), most of the platform runs Pg as well.
Attila
Hi Attila,
Sorry for the late reply.
Le 27/09/2013 10:34, Attila Heidrich a écrit :
Currently I have to restart the RHQ server twice a day with a cron job. Can I restart the server itself with a script triggered by some alert?
Most probably. You said in previous email that opened files prevented the server from running, but did you have any error messages in RHQ server or Postgres logs?
You might be impacted by Bug 1009640 - "JDBC connections leaked during baseline calculations". I can help you getting RHQ 4.9 rebuilt with the necessary patch.
Attila
2013/9/24 Attila Heidrich <attila.heidrich@gmail.com mailto:attila.heidrich@gmail.com>
The version data below belongs to the server which runs the RHQ server and the RHQ database. I understand it's easy to reach the 10K open files, I can also accept if it is normal, but what are the recommended settings then? Other Pg servers which don't run the RHQ server works for a long time with higher "active backend" amount and much smaller "open files" amount! root@ct-front:/etc/postgresql/9.2/main# ulimit -n 1024
I would recommend to set this to 2048. Can you also check the system wide opened file limit ("sysctl -n fs.file-max")
Attila
If your server stops working again, can you run this command and share the results:
"ps --no-headers -f -U postgres | awk '{print $2}' | while read pid ; do echo "Postgres Process: " ; ps --no-headers -f -p $pid ; echo "Opened files: " ; lsof -p $pid | wc -l ; echo ; echo ; done"
Regards, Thomas
Running since yesterday morning. I have enabled the postgres plugins, since the postgres log showed, that the agent was still active after "uninventory" - but with wrong credentials.
I had to restart the rhq-server (only the server, neither the Pg, nor the agent) this morning, since the number of open files raised above 10K.
The processes before restart:
root@ct-front:~# ps auxw|grep postgres|grep rhq postgres 1631 0.0 0.1 121300 44740 ? Ss Sep23 0:05 postgres: rhq rhq 127.0.0.1(46903) idle postgres 1735 0.0 0.1 118792 40384 ? Ss Sep23 0:09 postgres: rhq rhq 127.0.0.1(36888) idle postgres 3669 0.4 0.1 118572 29676 ? Ss 08:40 0:00 postgres: rhq rhq 127.0.0.1(54749) idle postgres 3670 0.3 0.1 119164 27232 ? Ss 08:40 0:00 postgres: rhq rhq 127.0.0.1(54750) idle postgres 5669 0.0 0.0 102028 5644 ? Ss Sep23 0:00 postgres: rhq rhq 127.0.0.1(56115) idle postgres 5680 0.0 0.1 115072 39132 ? Ss Sep23 0:05 postgres: rhq rhq 127.0.0.1(56125) idle postgres 6025 0.0 0.0 103304 10648 ? Ss Sep23 0:27 postgres: rhq rhq 127.0.0.1(56210) idle postgres 6041 0.0 0.1 123280 48024 ? Ss Sep23 0:11 postgres: rhq rhq 127.0.0.1(56215) idle postgres 6060 0.0 0.1 119700 42168 ? Ss Sep23 0:17 postgres: rhq rhq 127.0.0.1(56237) idle postgres 7387 0.0 0.0 106948 16364 ? Ss Sep23 1:01 postgres: postgres rhq 127.0.0.1(60327) idle postgres 8535 0.0 0.1 114120 36108 ? Ss 02:51 0:04 postgres: rhq rhq 127.0.0.1(48651) idle postgres 11169 0.0 0.1 118580 39900 ? Ss 00:00 0:07 postgres: rhq rhq 127.0.0.1(59908) idle postgres 11409 0.0 0.1 123716 46292 ? Ss Sep23 0:11 postgres: rhq rhq 127.0.0.1(46198) idle postgres 11431 0.0 0.2 124624 52432 ? Ss Sep23 0:47 postgres: rhq rhq 127.0.0.1(46289) idle postgres 11568 0.0 0.1 116364 40484 ? Ss Sep23 0:05 postgres: rhq rhq 127.0.0.1(34407) idle postgres 12416 0.0 0.1 122788 49160 ? Ss Sep23 0:17 postgres: rhq rhq 127.0.0.1(52744) idle postgres 15707 0.0 0.0 103324 10892 ? Ss Sep23 0:25 postgres: rhq rhq 127.0.0.1(60936) idle postgres 19382 0.0 0.1 122916 48488 ? Ss Sep23 0:08 postgres: rhq rhq 127.0.0.1(56676) idle postgres 19868 0.1 0.1 118104 38276 ? Ss 07:01 0:09 postgres: rhq rhq 127.0.0.1(45391) idle postgres 20554 0.0 0.1 118244 40764 ? Ss 04:00 0:08 postgres: rhq rhq 127.0.0.1(55602) idle postgres 21534 0.0 0.2 125260 50348 ? Ss 01:00 0:12 postgres: rhq rhq 127.0.0.1(37533) idle postgres 22480 0.0 0.1 105776 30264 ? Ss Sep23 0:03 postgres: rhq rhq 127.0.0.1(58453) idle postgres 23255 0.0 0.1 114344 43000 ? Ss Sep23 0:28 postgres: rhq rhq 127.0.0.1(52114) idle postgres 23435 0.1 0.1 120388 44036 ? Ss 07:21 0:05 postgres: rhq rhq 127.0.0.1(47287) idle postgres 24113 0.0 0.1 123212 48060 ? Ss 04:21 0:05 postgres: rhq rhq 127.0.0.1(57681) idle postgres 25861 0.0 0.1 119720 43260 ? Ss 01:25 0:05 postgres: rhq rhq 127.0.0.1(40007) idle postgres 27396 0.0 0.1 124076 48696 ? Ss 04:40 0:08 postgres: rhq rhq 127.0.0.1(59573) idle postgres 28973 0.0 0.1 116924 39768 ? Ss Sep23 0:05 postgres: rhq rhq 127.0.0.1(43984) idle postgres 30394 0.2 0.1 125260 39752 ? Ss 08:01 0:06 postgres: rhq rhq 127.0.0.1(51100) idle postgres 30455 0.0 0.2 123336 50372 ? Ss Sep23 0:15 postgres: rhq rhq 127.0.0.1(34681) idle postgres 30576 0.2 0.1 124888 39200 ? Ss 08:01 0:06 postgres: rhq rhq 127.0.0.1(51154) idle postgres 30841 0.1 0.1 121776 38844 ? Ss 08:03 0:04 postgres: rhq rhq 127.0.0.1(51340) idle postgres 31674 0.2 0.1 125424 39604 ? Ss 08:08 0:04 postgres: rhq rhq 127.0.0.1(51823) idle postgres 31702 0.0 0.2 122984 49660 ? Ss Sep23 0:07 postgres: rhq rhq 127.0.0.1(40525) idle postgres 32761 0.0 0.1 123000 47332 ? Ss Sep23 0:07 postgres: rhq rhq 127.0.0.1(53885) idle
and after restart:
root@ct-front:~# ps auxw|grep postgres|grep rhq postgres 4571 4.6 0.0 105016 15876 ? Ss 08:43 0:01 postgres: rhq rhq 127.0.0.1(55153) idle postgres 4572 0.0 0.0 102052 5656 ? Ss 08:43 0:00 postgres: rhq rhq 127.0.0.1(55156) idle postgres 7387 0.0 0.0 106948 16364 ? Ss Sep23 1:01 postgres: postgres rhq 127.0.0.1(60327) idle
I have uploaded a screenshot here: https://www.dropbox.com/s/ysmh9uq37xxacs0/RHQ-restart.png
Regards,
Attila
2013/9/23 Attila Heidrich attila.heidrich@gmail.com
Postgres database stopped again... I guess the problem was the enormous number of open files...
We use 9.2, and we also use the postgres plugin - which still doesn't really support 9.2 as far as I know.
Altogether Postgres (only the one storing the RHQ database) can only run for a few days, than the number of open files and the open connections raise really high, and finally I should restart it.
Usually I restart RHQ as well, I generally have no time to play with "what to restart".
The log is practically endless, I can quote, but I think I should know something to look for in it.
Regards,
Attila
Hi,
What is the value of "max_files_per_process" in your Postgres databases?
Thanks, Thomas
Le 24/09/2013 09:34, Attila Heidrich a écrit :
Running since yesterday morning. I have enabled the postgres plugins, since the postgres log showed, that the agent was still active after "uninventory" - but with wrong credentials.
I had to restart the rhq-server (only the server, neither the Pg, nor the agent) this morning, since the number of open files raised above 10K.
The processes before restart:
root@ct-front:~# ps auxw|grep postgres|grep rhq postgres 1631 0.0 0.1 121300 44740 ? Ss Sep23 0:05 postgres: rhq rhq 127.0.0.1(46903) idle postgres 1735 0.0 0.1 118792 40384 ? Ss Sep23 0:09 postgres: rhq rhq 127.0.0.1(36888) idle postgres 3669 0.4 0.1 118572 29676 ? Ss 08:40 0:00 postgres: rhq rhq 127.0.0.1(54749) idle postgres 3670 0.3 0.1 119164 27232 ? Ss 08:40 0:00 postgres: rhq rhq 127.0.0.1(54750) idle postgres 5669 0.0 0.0 102028 5644 ? Ss Sep23 0:00 postgres: rhq rhq 127.0.0.1(56115) idle postgres 5680 0.0 0.1 115072 39132 ? Ss Sep23 0:05 postgres: rhq rhq 127.0.0.1(56125) idle postgres 6025 0.0 0.0 103304 10648 ? Ss Sep23 0:27 postgres: rhq rhq 127.0.0.1(56210) idle postgres 6041 0.0 0.1 123280 48024 ? Ss Sep23 0:11 postgres: rhq rhq 127.0.0.1(56215) idle postgres 6060 0.0 0.1 119700 42168 ? Ss Sep23 0:17 postgres: rhq rhq 127.0.0.1(56237) idle postgres 7387 0.0 0.0 106948 16364 ? Ss Sep23 1:01 postgres: postgres rhq 127.0.0.1(60327) idle postgres 8535 0.0 0.1 114120 36108 ? Ss 02:51 0:04 postgres: rhq rhq 127.0.0.1(48651) idle postgres 11169 0.0 0.1 118580 39900 ? Ss 00:00 0:07 postgres: rhq rhq 127.0.0.1(59908) idle postgres 11409 0.0 0.1 123716 46292 ? Ss Sep23 0:11 postgres: rhq rhq 127.0.0.1(46198) idle postgres 11431 0.0 0.2 124624 52432 ? Ss Sep23 0:47 postgres: rhq rhq 127.0.0.1(46289) idle postgres 11568 0.0 0.1 116364 40484 ? Ss Sep23 0:05 postgres: rhq rhq 127.0.0.1(34407) idle postgres 12416 0.0 0.1 122788 49160 ? Ss Sep23 0:17 postgres: rhq rhq 127.0.0.1(52744) idle postgres 15707 0.0 0.0 103324 10892 ? Ss Sep23 0:25 postgres: rhq rhq 127.0.0.1(60936) idle postgres 19382 0.0 0.1 122916 48488 ? Ss Sep23 0:08 postgres: rhq rhq 127.0.0.1(56676) idle postgres 19868 0.1 0.1 118104 38276 ? Ss 07:01 0:09 postgres: rhq rhq 127.0.0.1(45391) idle postgres 20554 0.0 0.1 118244 40764 ? Ss 04:00 0:08 postgres: rhq rhq 127.0.0.1(55602) idle postgres 21534 0.0 0.2 125260 50348 ? Ss 01:00 0:12 postgres: rhq rhq 127.0.0.1(37533) idle postgres 22480 0.0 0.1 105776 30264 ? Ss Sep23 0:03 postgres: rhq rhq 127.0.0.1(58453) idle postgres 23255 0.0 0.1 114344 43000 ? Ss Sep23 0:28 postgres: rhq rhq 127.0.0.1(52114) idle postgres 23435 0.1 0.1 120388 44036 ? Ss 07:21 0:05 postgres: rhq rhq 127.0.0.1(47287) idle postgres 24113 0.0 0.1 123212 48060 ? Ss 04:21 0:05 postgres: rhq rhq 127.0.0.1(57681) idle postgres 25861 0.0 0.1 119720 43260 ? Ss 01:25 0:05 postgres: rhq rhq 127.0.0.1(40007) idle postgres 27396 0.0 0.1 124076 48696 ? Ss 04:40 0:08 postgres: rhq rhq 127.0.0.1(59573) idle postgres 28973 0.0 0.1 116924 39768 ? Ss Sep23 0:05 postgres: rhq rhq 127.0.0.1(43984) idle postgres 30394 0.2 0.1 125260 39752 ? Ss 08:01 0:06 postgres: rhq rhq 127.0.0.1(51100) idle postgres 30455 0.0 0.2 123336 50372 ? Ss Sep23 0:15 postgres: rhq rhq 127.0.0.1(34681) idle postgres 30576 0.2 0.1 124888 39200 ? Ss 08:01 0:06 postgres: rhq rhq 127.0.0.1(51154) idle postgres 30841 0.1 0.1 121776 38844 ? Ss 08:03 0:04 postgres: rhq rhq 127.0.0.1(51340) idle postgres 31674 0.2 0.1 125424 39604 ? Ss 08:08 0:04 postgres: rhq rhq 127.0.0.1(51823) idle postgres 31702 0.0 0.2 122984 49660 ? Ss Sep23 0:07 postgres: rhq rhq 127.0.0.1(40525) idle postgres 32761 0.0 0.1 123000 47332 ? Ss Sep23 0:07 postgres: rhq rhq 127.0.0.1(53885) idle
and after restart:
root@ct-front:~# ps auxw|grep postgres|grep rhq postgres 4571 4.6 0.0 105016 15876 ? Ss 08:43 0:01 postgres: rhq rhq 127.0.0.1(55153) idle postgres 4572 0.0 0.0 102052 5656 ? Ss 08:43 0:00 postgres: rhq rhq 127.0.0.1(55156) idle postgres 7387 0.0 0.0 106948 16364 ? Ss Sep23 1:01 postgres: postgres rhq 127.0.0.1(60327) idle
I have uploaded a screenshot here: https://www.dropbox.com/s/ysmh9uq37xxacs0/RHQ-restart.png
Regards,
Attila
2013/9/23 Attila Heidrich <attila.heidrich@gmail.com mailto:attila.heidrich@gmail.com>
Postgres database stopped again... I guess the problem was the enormous number of open files... We use 9.2, and we also use the postgres plugin - which still doesn't really support 9.2 as far as I know. Altogether Postgres (only the one storing the RHQ database) can only run for a few days, than the number of open files and the open connections raise really high, and finally I should restart it. Usually I restart RHQ as well, I generally have no time to play with "what to restart". The log is practically endless, I can quote, but I think I should know something to look for in it. Regards, Attila
rhq-users mailing list rhq-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/rhq-users
What is the value of "max_files_per_process" in your Postgres databases?
Thanks, Thomas
It is on the default value, maybe 1000?
root@ct-front:/etc/postgresql/9.2/main# grep max_files_per_process * postgresql.conf:#max_files_per_process = 1000 # min 25
2013/9/23 Attila Heidrich attila.heidrich@gmail.com
Postgres database stopped again... I guess the problem was the enormous number of open files...
We use 9.2, and we also use the postgres plugin - which still doesn't really support 9.2 as far as I know.
Altogether Postgres (only the one storing the RHQ database) can only run for a few days, than the number of open files and the open connections raise really high, and finally I should restart it.
Usually I restart RHQ as well, I generally have no time to play with "what to restart".
The log is practically endless, I can quote, but I think I should know something to look for in it.
Regards,
Attila
Le 24/09/2013 11:42, Attila Heidrich a écrit :
What is the value of "max_files_per_process" in your Postgres databases?
Thanks, Thomas
It is on the default value, maybe 1000?
root@ct-front:/etc/postgresql/9.2/main# grep max_files_per_process * postgresql.conf:#max_files_per_process = 1000 # min 25
Yes, if unset, the default is 1000. Which OS your database is running on?
rhq-users@lists.stg.fedorahosted.org