Hi,
We've got a number of machines using sssd to connect to LDAP for auth. In the past we've had problems with sssd crashing regularly[1], but after posting here we built some custom packages to disable netlink notifications from the kernel, and it's generally improved.
We're still seeing auth failures across random machines - perhaps 1-2% when we run a process which connects to all hosts. The machines are generally heavily loaded when this happens, and sssd.log looks like:
(Fri Apr 29 09:31:19 2016) [sssd] [ping_check] (0x0020): A service PING timed out on [nss]. Attempt [0] (Fri Apr 29 09:31:29 2016) [sssd] [tasks_check_handler] (0x0020): Child (meraki) not responding! (yet) (Fri Apr 29 09:31:39 2016) [sssd] [tasks_check_handler] (0x0020): Child (meraki) not responding! (yet) (Fri Apr 29 09:31:39 2016) [sssd] [ping_check] (0x0020): A service PING timed out on [nss]. Attempt [0]
While sssd is in this state, it seems to deny auth randomly for LDAP users - they receive "connection closed by remote host". It will eventually restart its children, but that doesn't seem to fix the problem.
Logs for the meraki domain and for nss indicate the subprocesses are running:
/var/log/sssd/sssd_meraki.log (Fri Apr 29 09:30:53 2016) [sssd[be[meraki]]] [sdap_save_user] (0x0400): Storing info for user blinken (Fri Apr 29 09:31:22 2016) [sssd[be[meraki]]] [sdap_initgr_rfc2307_next_base] (0x0400): Searching for groups with base [dc=meraki,dc=com] (Fri Apr 29 09:31:22 2016) [sssd[be[meraki]]] [sdap_get_generic_ext_step] (0x0400): calling ldap_search_ext with [(&(memberuid=blinken)(objectClass=posixGroup)(cn=*)(&(gidNumber=*)(!( gidNumber=0))))][dc=meraki,dc=com]. (Fri Apr 29 09:31:22 2016) [sssd[be[meraki]]] [sdap_get_generic_op_finished] (0x0400): Search result: Success(0), no errmsg set
/var/log/sssd/sssd_nss.log (Fri Apr 29 09:31:22 2016) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0080): No matching domain found for [1155] (Fri Apr 29 09:31:22 2016) [sssd[nss]] [nss_cmd_getbynam] (0x0100): Requesting info for [blinken] from [<ALL>] (Fri Apr 29 09:31:22 2016) [sssd[nss]] [nss_cmd_initgroups_search] (0x0100): Requesting info for [blinken@meraki] (Fri Apr 29 09:31:26 2016) [sssd[nss]] [calc_flat_name] (0x0080): Flat name requested but domain has noflat name set, falling back to domain name (Fri Apr 29 09:31:26 2016) [sssd[nss]] [nss_cmd_getbynam] (0x0100): Requesting info for [meraki] from [<ALL>] (Fri Apr 29 09:31:26 2016) [sssd[nss]] [nss_cmd_initgroups_search] (0x0080): No matching domain found for [meraki], fail!
We first saw the behaviour on sssd 1.11.7 and have upgraded to sssd version 1.13.4, with more or less the same symptoms. We've turned enumerate on and off with no apparent change in behaviour.
Does anyone have any suggestions here? Let me know if I can provide more detailed debugging information (perhaps off-list).
Cheers,
Patrick
1. https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.o...
On (29/04/16 17:56), Patrick Coleman wrote:
Hi,
We've got a number of machines using sssd to connect to LDAP for auth. In the past we've had problems with sssd crashing regularly[1], but after posting here we built some custom packages to disable netlink notifications from the kernel, and it's generally improved.
We're still seeing auth failures across random machines - perhaps 1-2% when we run a process which connects to all hosts. The machines are generally heavily loaded when this happens, and sssd.log looks like:
Do you meand IO related load or CPU related load?
If there is issue with CPU then you can mount sssd cache to tmpfs to avoid such issues. (there are plans to improve it in 1.14)
LS
sssd-users@lists.fedorahosted.org