We're continuing to have timeout issues in Taskotron staging and my investigation so far seems to indicate the cause to be slow queries in the database.
This change does require a database restart but due to the way that Taskotron works, I can do this so that there are no lost jobs and no significant downtime. I'd stop incoming jobs until all queues are empty and record the jobs which would have been scheduled on a machine outside of infra. Once those queues are empty, I'd shut down all the db-using processes, apply the patch, restart the db, start everything back up and enqueue the jobs which would have been scheduled.
This template change to the postgresql-server module will only affect db-qa01.qa but will make it look like other postgres servers have a pending change due to the way I've changed the postgresql.conf template.
I'm not crazy about doing this during freeze but I'm worried that the timeout problem will start affecting more than stg before long and want to get this figured out before that happens.
+1s?
Tim
diff --git a/roles/postgresql_server/templates/postgresql.conf b/roles/postgresql_server/templates/postgresql.conf index 603f9ea..c9756b8 100644 --- a/roles/postgresql_server/templates/postgresql.conf +++ b/roles/postgresql_server/templates/postgresql.conf @@ -319,9 +319,15 @@ log_rotation_size = 0 # Automatic rotation of logfiles will # fatal # panic (effectively off)
+{% if ansible_hostname.startswith("db-qa01") %} +log_min_duration_statement = 500 # -1 is disabled, 0 logs all statements + # and their durations, > 0 logs only + # statements running at least this time. +{% else %} #log_min_duration_statement = -1 # -1 is disabled, 0 logs all statements # and their durations, > 0 logs only # statements running at least this time. +{% endif %}
#silent_mode = off # DO NOT USE without syslog or # logging_collector
+1.
We may after freeze want to look at adding something like this to all the db servers and also add some kind of daily report of those so we can see when slow queries start happening...
kevin
+1 this looks good.
On 6 April 2015 at 10:33, Tim Flink tflink@redhat.com wrote:
We're continuing to have timeout issues in Taskotron staging and my investigation so far seems to indicate the cause to be slow queries in the database.
This change does require a database restart but due to the way that Taskotron works, I can do this so that there are no lost jobs and no significant downtime. I'd stop incoming jobs until all queues are empty and record the jobs which would have been scheduled on a machine outside of infra. Once those queues are empty, I'd shut down all the db-using processes, apply the patch, restart the db, start everything back up and enqueue the jobs which would have been scheduled.
This template change to the postgresql-server module will only affect db-qa01.qa but will make it look like other postgres servers have a pending change due to the way I've changed the postgresql.conf template.
I'm not crazy about doing this during freeze but I'm worried that the timeout problem will start affecting more than stg before long and want to get this figured out before that happens.
+1s?
Tim
diff --git a/roles/postgresql_server/templates/postgresql.conf b/roles/postgresql_server/templates/postgresql.conf index 603f9ea..c9756b8 100644 --- a/roles/postgresql_server/templates/postgresql.conf +++ b/roles/postgresql_server/templates/postgresql.conf @@ -319,9 +319,15 @@ log_rotation_size = 0 # Automatic rotation of logfiles will # fatal # panic (effectively off)
+{% if ansible_hostname.startswith("db-qa01") %} +log_min_duration_statement = 500 # -1 is disabled, 0 logs all statements
# and their durations, > 0 logs
only
# statements running at least
this time. +{% else %} #log_min_duration_statement = -1 # -1 is disabled, 0 logs all statements # and their durations, > 0 logs only # statements running at least this time. +{% endif %}
#silent_mode = off # DO NOT USE without syslog or # logging_collector
infrastructure mailing list infrastructure@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/infrastructure
On Mon, 6 Apr 2015 10:33:15 -0600 Tim Flink tflink@redhat.com wrote:
I'm not crazy about doing this during freeze but I'm worried that the timeout problem will start affecting more than stg before long and want to get this figured out before that happens.
As an update - I'm not ignoring the +1s, I just managed to stumble on something I can poke at outside of production when preparing to make the change on the db server.
I'm still hoping to avoid changing the db server during freeze, so I'm going to keep digging into the issue using my local setup now that I have something I can dig into.
For anyone who's interested in following along, the issue is being tracked as:
https://phab.qadevel.cloud.fedoraproject.org/T452
Tim
On Mon, 6 Apr 2015 13:38:29 -0600 Tim Flink tflink@redhat.com wrote:
On Mon, 6 Apr 2015 10:33:15 -0600 Tim Flink tflink@redhat.com wrote:
I'm not crazy about doing this during freeze but I'm worried that the timeout problem will start affecting more than stg before long and want to get this figured out before that happens.
As an update - I'm not ignoring the +1s, I just managed to stumble on something I can poke at outside of production when preparing to make the change on the db server.
I'm still hoping to avoid changing the db server during freeze, so I'm going to keep digging into the issue using my local setup now that I have something I can dig into.
For anyone who's interested in following along, the issue is being tracked as:
After much debugging and poking, I think that I've figured out why Taskotron staging is having the issues that it is. The root is still slow queries but instead of modifying the production database, there is a way to change how our code is using those slow queries to keep mod_wsgi from timing out and killing the requests.
Since we have a different approach to solving the problem (at least in the short term), I won't be applying the patch to ansible or modifying the database configuration during freeze.
Tim
infrastructure@lists.fedoraproject.org