Hi,
We run sssd to bind a number of machines to LDAP for auth. On a subset of these machines, we have software that makes several thousand IPv6 route changes per second.
Recently, we found that on these hosts the sssd_nss responder process fails several times a day[1], and will not recover until sssd is restarted. strace[2] of the main sssd process indicates that sssd is receiving many, many netlink messages - so many, in fact, that sssd cannot process them fast enough and is receiving ENOBUFS from recvmsg(2).
The messages that are received seem to get forwarded[3] to the sssd responders over the unix socket and flood them until they fail.
From what I can see, the netlink code in src/monitor/monitor_netlink.c:setup_netlink() subscribes to netlink notifications with the aim of detecting things like wifi network changes. This isn't something we'd find useful on our servers and seems to have performance implications - is there any easy way of turning off this functionality in sssd that I've missed?
We see this issue running sssd 1.11.7.
Cheers,
Patrick
1. The failures look something like this. I have replaced our sss domain with "ourdomain" /var/log/sssd/sssd_nss.log
(Tue Mar 22 02:58:01 2016) [sssd[nss]] [accept_fd_handler] (0x0100): Client connected! (Tue Mar 22 02:58:01 2016) [sssd[nss]] [nss_cmd_initgroups] (0x0100): Requesting info for [systemuser] from [<ALL>] (Tue Mar 22 02:58:01 2016) [sssd[nss]] [nss_cmd_initgroups_search] (0x0100): Requesting info for [systemuser@ourdomain] (Tue Mar 22 02:59:04 2016) [sssd[nss]] [nss_cmd_initgroups_dp_callback] (0x0040): Unable to get information from Data Provider Error: 3, 5, (null) Will try to return what we have in cache (Tue Mar 22 02:59:14 2016) [sssd[nss]] [server_setup] (0x0080): CONFDB: /var/lib/sss/db/config.ldb (Tue Mar 22 02:59:14 2016) [sssd[nss]] [monitor_common_send_id] (0x0100): Sending ID: (nss,1) (Tue Mar 22 02:59:14 2016) [sssd[nss]] [check_file] (0x0020): lstat for [/var/lib/sss/pipes/private/sbus-dp_ourdomain] failed: [2][No such file or directory]. (Tue Mar 22 02:59:14 2016) [sssd[nss]] [sbus_client_init] (0x0020): check_file failed for [/var/lib/sss/pipes/private/sbus-dp_ourdomain]. (Tue Mar 22 02:59:14 2016) [sssd[nss]] [sss_dp_init] (0x0010): Failed to connect to monitor services. (Tue Mar 22 02:59:14 2016) [sssd[nss]] [sss_process_init] (0x0010): fatal error setting up backend connector (Tue Mar 22 02:59:14 2016) [sssd[nss]] [server_setup] (0x0080): CONFDB: /var/lib/sss/db/config.ldb (Tue Mar 22 02:59:14 2016) [sssd[nss]] [monitor_common_send_id] (0x0100): Sending ID: (nss,1) (Tue Mar 22 02:59:14 2016) [sssd[nss]] [check_file] (0x0020): lstat for [/var/lib/sss/pipes/private/sbus-dp_ourdomain] failed: [2][No such file or directory]. (Tue Mar 22 02:59:14 2016) [sssd[nss]] [sbus_client_init] (0x0020): check_file failed for [/var/lib/sss/pipes/private/sbus-dp_ourdomain]. (Tue Mar 22 02:59:14 2016) [sssd[nss]] [sss_dp_init] (0x0010): Failed to connect to monitor services. (Tue Mar 22 02:59:14 2016) [sssd[nss]] [sss_process_init] (0x0010): fatal error setting up backend connector (Tue Mar 22 02:59:14 2016) [sssd[nss]] [server_setup] (0x0080): CONFDB: /var/lib/sss/db/config.ldb (Tue Mar 22 02:59:14 2016) [sssd[nss]] [monitor_common_send_id] (0x0100): Sending ID: (nss,1) (Tue Mar 22 02:59:14 2016) [sssd[nss]] [check_file] (0x0020): lstat for [/var/lib/sss/pipes/private/sbus-dp_ourdomain] failed: [2][No such file or directory]. (Tue Mar 22 02:59:14 2016) [sssd[nss]] [sbus_client_init] (0x0020): check_file failed for [/var/lib/sss/pipes/private/sbus-dp_ourdomain]. (Tue Mar 22 02:59:14 2016) [sssd[nss]] [sss_dp_init] (0x0010): Failed to connect to monitor services. (Tue Mar 22 02:59:14 2016) [sssd[nss]] [sss_process_init] (0x0010): fatal error setting up backend connector (no further messages until service restarts)
2. strace sssd -i sendmsg(12, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1\0\0\0\0M\10\0\0e\0\0\0\1\1o\0\35\0\0\0/org/fre"..., 120}, {"", 0}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 120 gettimeofday({1458663090, 595705}, NULL) = 0 epoll_wait(5, {{EPOLLIN, {u32=150507384, u64=150507384}}}, 1, 1563) = 1 recvmsg(12, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\0\0\0\0M\10\0\0\10\0\0\0\5\1u\0M\10\0\0\0\2\1\1\0\0\0\0"..., 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 24 gettimeofday({1458663090, 595978}, NULL) = 0 recvmsg(12, 0xffcdcac0, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable) gettimeofday({1458663090, 596110}, NULL) = 0 gettimeofday({1458663090, 596151}, NULL) = 0 epoll_wait(5, {{EPOLLIN|EPOLLERR, {u32=150512424, u64=150512424}}}, 1, 1563) = 1 recvmsg(11, 0xffcdcb3c, 0) = -1 ENOBUFS (No buffer space available) gettimeofday({1458663090, 596330}, NULL) = 0 stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0 stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0 write(2, "(Tue Mar 22 09:11:30 2016) [sssd"..., 65(Tue Mar 22 09:11:30 2016) [sssd] [netlink_fd_handler] (0x0020): ) = 65 write(2, "Error while reading from netlink"..., 36Error while reading from netlink fd ) = 36 gettimeofday({1458663090, 596538}, NULL) = 0 epoll_wait(5, {{EPOLLIN, {u32=150512424, u64=150512424}}}, 1, 1563) = 1 recvmsg(11, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000400}, msg_iov(1)=[{"\224\0\0\0\31\0\0\0\0\0\0\0\0\0\0\0\n\200\0\0\1\0\0\1\0\2\0\0\10\0\17\0"..., 4096}], msg_controllen=24, {cmsg_len=24, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=0, uid=0, gid=0}}, msg_flags=0}, 0) = 148 gettimeofday({1458663090, 596679}, NULL) = 0 sendmsg(12, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1\0\0\0\0N\10\0\0e\0\0\0\1\1o\0\35\0\0\0/org/fre"..., 120}, {"", 0}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 120 gettimeofday({1458663090, 596777}, NULL) = 0
3. /var/log/sssd/sssd.log (Wed Mar 23 06:59:07 2016) [sssd] [message_type] (0x0200): netlink Message type: 24 (Wed Mar 23 06:59:07 2016) [sssd] [route_msg_debug_print] (0x1000): route idx 209591 flags 0X200 family 10 addr fd0a:9b09:1f7:0:218:aff:fe34:3d08/128 (Wed Mar 23 06:59:07 2016) [sssd] [network_status_change_cb] (0x2000): A networking status change detected signaling providers to reset offline status (Wed Mar 23 06:59:07 2016) [sssd] [service_signal] (0x0020): Could not signal service [ourdomain]. (Wed Mar 23 06:59:07 2016) [sssd] [message_type] (0x0200): netlink Message type: 24 (Wed Mar 23 06:59:07 2016) [sssd] [route_msg_debug_print] (0x1000): route idx 209591 flags 0X200 family 10 addr fd0a:9b09:1f7:0:218:aff:fe33:3b66/128 (Wed Mar 23 06:59:07 2016) [sssd] [network_status_change_cb] (0x2000): A networking status change detected signaling providers to reset offline status (Wed Mar 23 06:59:07 2016) [sssd] [service_signal] (0x0020): Could not signal service [ourdomain]. (Wed Mar 23 06:59:07 2016) [sssd] [message_type] (0x0200): netlink Message type: 24
$ grep network_status_change_cb sssd.log | grep '06:59:12' | wc -l 1245
sssd-users@lists.fedorahosted.org