New subject: "Child not responding" on loaded servers

3 May 2016


      On 1 May 2016 at 17:04, Jakub Hrozek jhrozek@redhat.com wrote:
...
...
On 30 Apr 2016, at 10:28, Patrick Coleman patrick.coleman@meraki.com wrote:
On 29 Apr 2016 9:10 pm, "Lukas Slebodnik" lslebodn@redhat.com wrote:
...
Do you meand IO related load or CPU related load?
Lots of both, but we're typically IO bound more of the time.
...
If there is issue with CPU then you can mount sssd cache to tmpfs
to avoid such issues. (there are plans to improve it in 1.14)
Cool, I'll give that a go.
Alternatively, increase the 'timeout' option in sssd's sections..
I appreciate the advice, thankyou. I've put /var/lib/sss on to a tmpfs
filesystem on a couple of loaded machines and seen what I believe to
be improvements - it's a little too early to say, but I'll report back
once I have a wider deployment.
I did want to feed back a little of our research into this issue. If
we strace the sssd_be subprocess on a loaded machine, we see it
sitting in msync() and fdatasync() for periods of up to 7.3 seconds in
one test. This is perhaps expected, given the machine is under heavy
IO load, but sssd makes a *lot* of these calls.
In a 7m 49.985s test (this is as long as the sssd_be process lasts
before it is killed by the parent for not replying to ping) on a
machine with moderate disk load and no new interactive logins, sssd
made 232 *sync calls. The median syscall takes only 67ms, but the
maximum is more than seven seconds - in the eight minute test sssd
spent 1m 00.044s in *sync system calls.
My (naive) analysis here is that the backend process is spending 13%
of its time unavailable to service account queries, because it's doing
cache maintenance. This seems to rather defeat the point of having a
cache... are my assumptions correct here? I'm happy to send the strace
log (and any other data) to interested parties off-list, just let me
know.
In an attempt to improve this behaviour, in addition to a tmpfs for
/var/lib/sss I've also just added the following to the nss and pam
stanzas in the config:
memcache_timeout = 1800
entry_cache_timeout = 1800
...the idea being they will respond from their own cache without
contacting the backend, which may be busy per the above. Is this
reasonable?
Cheers,
Patrick

Re: "Child not responding" on loaded servers