On (23/03/16 15:49), Patrick Coleman wrote:
Hi,
We run sssd to bind a number of machines to LDAP for auth. On a subset of these machines, we have software that makes several thousand IPv6 route changes per second.
Recently, we found that on these hosts the sssd_nss responder process fails several times a day[1], and will not recover until sssd is restarted. strace[2] of the main sssd process indicates that sssd is receiving many, many netlink messages - so many, in fact, that sssd cannot process them fast enough and is receiving ENOBUFS from recvmsg(2).
The messages that are received seem to get forwarded[3] to the sssd responders over the unix socket and flood them until they fail.
From what I can see, the netlink code in src/monitor/monitor_netlink.c:setup_netlink() subscribes to netlink notifications with the aim of detecting things like wifi network changes. This isn't something we'd find useful on our servers and seems to have performance implications - is there any easy way of turning off this functionality in sssd that I've missed?
We see this issue running sssd 1.11.7.
Cheers,
Patrick
- The failures look something like this. I have replaced our sss
domain with "ourdomain" /var/log/sssd/sssd_nss.log
(Tue Mar 22 02:58:01 2016) [sssd[nss]] [accept_fd_handler] (0x0100): Client connected! (Tue Mar 22 02:58:01 2016) [sssd[nss]] [nss_cmd_initgroups] (0x0100): Requesting info for [systemuser] from [<ALL>] (Tue Mar 22 02:58:01 2016) [sssd[nss]] [nss_cmd_initgroups_search] (0x0100): Requesting info for [systemuser@ourdomain] (Tue Mar 22 02:59:04 2016) [sssd[nss]] [nss_cmd_initgroups_dp_callback] (0x0040): Unable to get information from Data Provider Error: 3, 5, (null)
The real error is in sssd_$domain.log
neither sssd.log nor sssd_nss.log will help you.
@see https://fedorahosted.org/sssd/wiki/Troubleshooting
LS