Summary of Event ================
Tonight there was an unplanned outage of two proxy servers (proxy01 and proxy02). The proxies were unresponsive and needed to be rebooted in order to come back online. Proxy01 being down caused a cascade of other issues that should have had very little end-user impact. As far as we know, the applications on admin.fp.o would have been up but appeared very slow and the wiki would have been up for reading but logging in would have failed. Explanation to follow.
Proxy01 is the only proxy server that is used for app servers (web apps, cronjobs, etc) in phx2 that need to talk to our web applications in phx2. This was setup because the router that handles traffic into and out of phx2 does not allow us to "hairpin", send a request for data from phx2 to an external ip address that then resolves back to a server in phx2. As currently implemented, we have an /etc/hosts entry that points admin.fedoraproject.org at the internal ip address of phx2.
When proxy01 went down, things in PHX2 that needed to talk to admin.fedoraproject.org were no longer able to get the data they needed. For the wiki, this meant that attempting to login during the outage would be unable to verify the password in fas. For the TurboGears apps on admin.fedoraproject.org the situation was worse. TG1 apps' identity management depends on visit tracking to work. Visit tracking hits fas for every request. This means that no page could be served for the TG1 apps from the phx2 app servers.
We have two app servers that reside outside of phx2. Because of network latency between these servers and the database server in phx2, these servers are configured to be backups for the servers in phx2, not handling requests unless phx2 is unable to. The remaining proxy servers detected that the app servers within phx2 were down and properly switched over to app servers outside of phx2 so there was no apparent outage for people trying to use admin.fedoraproject.org, although response time would have been drastically less.
Looking at the haproxy status page for proxy03 during the outage we noticed that only one of the two app servers outside of phx2 (app05 at ibiblio) was handling traffic. app06 (at telia) was not. We are not sure why this is. One possibility is that telia's network latency is just too high so haproxy decided that app06 was also down and did not pass traffic to it.
Action Items ============
There are some open questions to try to resolve:
* Why did proxy01 and proxy02 die? A brief look at the logs has not revealed a cause for this. * Why didn't app06 take up any of the slack when haproxy started passing traffic to the backups?
We have identified one means of mitigating this in the future:
If we ran internal DNS for phx2 then we could have admin.fedoraproject.org resolve to different proxy servers (using internal ip addresses for the proxies inside of PHX2). This should remove the SPOF on proxy01. We have not yet determined whether we'd need to run more proxy servers inside of PHX2 or if hairpinning would not be an issue if we used proxy servers outside of phx2.
-Toshio
On Fri, 19 Aug 2011 19:45:45 -0700 Toshio Kuratomi a.badger@gmail.com wrote:
...snip...
Action Items
There are some open questions to try to resolve:
- Why did proxy01 and proxy02 die? A brief look at the logs has not revealed a cause for this.
I can't find any cause here. Logs just stop, they were locked up hard. ;(
As a side note: libvirt/kvm supports watchdog. We could possibly setup watchdog on all our guests so they at least reboot if they are unresponsive. Of course that could lead to problems if they get stuck in a reboot/lockup cycle.
- Why didn't app06 take up any of the slack when haproxy started
passing traffic to the backups?
Yeah, all I can think of is that it was too slow to answer and haproxy didn't want to add it.
We have identified one means of mitigating this in the future:
If we ran internal DNS for phx2 then we could have admin.fedoraproject.org resolve to different proxy servers (using internal ip addresses for the proxies inside of PHX2). This should remove the SPOF on proxy01. We have not yet determined whether we'd need to run more proxy servers inside of PHX2 or if hairpinning would not be an issue if we used proxy servers outside of phx2.
Well, we do run dns there, so we can tweak it. :)
Hairpinning only comes into play if we try and list a phx2 external IP in there. The problem with listing another external proxy is that then it's likely to be slow... the request would need to go all the way out, then back in to fas.
We could run another proxy thats just internal to phx2. That seems like it's sort of overkill though. ;(
I think I might sit down and draw up our proxy/app/fas/etc setup and perhaps we can look at a picture and see how we can simplify it or make it more robust.
kevin
infrastructure@lists.fedoraproject.org