Hi,
We run sssd to bind a number of machines to LDAP for auth. On a subset of these machines, we have software that makes several thousand IPv6 route changes per second.
Recently, we found that on these hosts the sssd_nss responder process fails several times a day[1], and will not recover until sssd is restarted. strace[2] of the main sssd process indicates that sssd is receiving many, many netlink messages - so many, in fact, that sssd cannot process them fast enough and is receiving ENOBUFS from recvmsg(2).
The messages that are received seem to get forwarded[3] to the sssd responders over the unix socket and flood them until they fail.
From what I can see, the netlink code in src/monitor/monitor_netlink.c:setup_netlink() subscribes to netlink notifications with the aim of detecting things like wifi network changes. This isn't something we'd find useful on our servers and seems to have performance implications - is there any easy way of turning off this functionality in sssd that I've missed?
We see this issue running sssd 1.11.7.
Cheers,
Patrick
1. The failures look something like this. I have replaced our sss domain with "ourdomain" /var/log/sssd/sssd_nss.log
(Tue Mar 22 02:58:01 2016) [sssd[nss]] [accept_fd_handler] (0x0100): Client connected! (Tue Mar 22 02:58:01 2016) [sssd[nss]] [nss_cmd_initgroups] (0x0100): Requesting info for [systemuser] from [<ALL>] (Tue Mar 22 02:58:01 2016) [sssd[nss]] [nss_cmd_initgroups_search] (0x0100): Requesting info for [systemuser@ourdomain] (Tue Mar 22 02:59:04 2016) [sssd[nss]] [nss_cmd_initgroups_dp_callback] (0x0040): Unable to get information from Data Provider Error: 3, 5, (null) Will try to return what we have in cache (Tue Mar 22 02:59:14 2016) [sssd[nss]] [server_setup] (0x0080): CONFDB: /var/lib/sss/db/config.ldb (Tue Mar 22 02:59:14 2016) [sssd[nss]] [monitor_common_send_id] (0x0100): Sending ID: (nss,1) (Tue Mar 22 02:59:14 2016) [sssd[nss]] [check_file] (0x0020): lstat for [/var/lib/sss/pipes/private/sbus-dp_ourdomain] failed: [2][No such file or directory]. (Tue Mar 22 02:59:14 2016) [sssd[nss]] [sbus_client_init] (0x0020): check_file failed for [/var/lib/sss/pipes/private/sbus-dp_ourdomain]. (Tue Mar 22 02:59:14 2016) [sssd[nss]] [sss_dp_init] (0x0010): Failed to connect to monitor services. (Tue Mar 22 02:59:14 2016) [sssd[nss]] [sss_process_init] (0x0010): fatal error setting up backend connector (Tue Mar 22 02:59:14 2016) [sssd[nss]] [server_setup] (0x0080): CONFDB: /var/lib/sss/db/config.ldb (Tue Mar 22 02:59:14 2016) [sssd[nss]] [monitor_common_send_id] (0x0100): Sending ID: (nss,1) (Tue Mar 22 02:59:14 2016) [sssd[nss]] [check_file] (0x0020): lstat for [/var/lib/sss/pipes/private/sbus-dp_ourdomain] failed: [2][No such file or directory]. (Tue Mar 22 02:59:14 2016) [sssd[nss]] [sbus_client_init] (0x0020): check_file failed for [/var/lib/sss/pipes/private/sbus-dp_ourdomain]. (Tue Mar 22 02:59:14 2016) [sssd[nss]] [sss_dp_init] (0x0010): Failed to connect to monitor services. (Tue Mar 22 02:59:14 2016) [sssd[nss]] [sss_process_init] (0x0010): fatal error setting up backend connector (Tue Mar 22 02:59:14 2016) [sssd[nss]] [server_setup] (0x0080): CONFDB: /var/lib/sss/db/config.ldb (Tue Mar 22 02:59:14 2016) [sssd[nss]] [monitor_common_send_id] (0x0100): Sending ID: (nss,1) (Tue Mar 22 02:59:14 2016) [sssd[nss]] [check_file] (0x0020): lstat for [/var/lib/sss/pipes/private/sbus-dp_ourdomain] failed: [2][No such file or directory]. (Tue Mar 22 02:59:14 2016) [sssd[nss]] [sbus_client_init] (0x0020): check_file failed for [/var/lib/sss/pipes/private/sbus-dp_ourdomain]. (Tue Mar 22 02:59:14 2016) [sssd[nss]] [sss_dp_init] (0x0010): Failed to connect to monitor services. (Tue Mar 22 02:59:14 2016) [sssd[nss]] [sss_process_init] (0x0010): fatal error setting up backend connector (no further messages until service restarts)
2. strace sssd -i sendmsg(12, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1\0\0\0\0M\10\0\0e\0\0\0\1\1o\0\35\0\0\0/org/fre"..., 120}, {"", 0}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 120 gettimeofday({1458663090, 595705}, NULL) = 0 epoll_wait(5, {{EPOLLIN, {u32=150507384, u64=150507384}}}, 1, 1563) = 1 recvmsg(12, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\0\0\0\0M\10\0\0\10\0\0\0\5\1u\0M\10\0\0\0\2\1\1\0\0\0\0"..., 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 24 gettimeofday({1458663090, 595978}, NULL) = 0 recvmsg(12, 0xffcdcac0, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable) gettimeofday({1458663090, 596110}, NULL) = 0 gettimeofday({1458663090, 596151}, NULL) = 0 epoll_wait(5, {{EPOLLIN|EPOLLERR, {u32=150512424, u64=150512424}}}, 1, 1563) = 1 recvmsg(11, 0xffcdcb3c, 0) = -1 ENOBUFS (No buffer space available) gettimeofday({1458663090, 596330}, NULL) = 0 stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0 stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0 write(2, "(Tue Mar 22 09:11:30 2016) [sssd"..., 65(Tue Mar 22 09:11:30 2016) [sssd] [netlink_fd_handler] (0x0020): ) = 65 write(2, "Error while reading from netlink"..., 36Error while reading from netlink fd ) = 36 gettimeofday({1458663090, 596538}, NULL) = 0 epoll_wait(5, {{EPOLLIN, {u32=150512424, u64=150512424}}}, 1, 1563) = 1 recvmsg(11, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000400}, msg_iov(1)=[{"\224\0\0\0\31\0\0\0\0\0\0\0\0\0\0\0\n\200\0\0\1\0\0\1\0\2\0\0\10\0\17\0"..., 4096}], msg_controllen=24, {cmsg_len=24, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=0, uid=0, gid=0}}, msg_flags=0}, 0) = 148 gettimeofday({1458663090, 596679}, NULL) = 0 sendmsg(12, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1\0\0\0\0N\10\0\0e\0\0\0\1\1o\0\35\0\0\0/org/fre"..., 120}, {"", 0}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 120 gettimeofday({1458663090, 596777}, NULL) = 0
3. /var/log/sssd/sssd.log (Wed Mar 23 06:59:07 2016) [sssd] [message_type] (0x0200): netlink Message type: 24 (Wed Mar 23 06:59:07 2016) [sssd] [route_msg_debug_print] (0x1000): route idx 209591 flags 0X200 family 10 addr fd0a:9b09:1f7:0:218:aff:fe34:3d08/128 (Wed Mar 23 06:59:07 2016) [sssd] [network_status_change_cb] (0x2000): A networking status change detected signaling providers to reset offline status (Wed Mar 23 06:59:07 2016) [sssd] [service_signal] (0x0020): Could not signal service [ourdomain]. (Wed Mar 23 06:59:07 2016) [sssd] [message_type] (0x0200): netlink Message type: 24 (Wed Mar 23 06:59:07 2016) [sssd] [route_msg_debug_print] (0x1000): route idx 209591 flags 0X200 family 10 addr fd0a:9b09:1f7:0:218:aff:fe33:3b66/128 (Wed Mar 23 06:59:07 2016) [sssd] [network_status_change_cb] (0x2000): A networking status change detected signaling providers to reset offline status (Wed Mar 23 06:59:07 2016) [sssd] [service_signal] (0x0020): Could not signal service [ourdomain]. (Wed Mar 23 06:59:07 2016) [sssd] [message_type] (0x0200): netlink Message type: 24
$ grep network_status_change_cb sssd.log | grep '06:59:12' | wc -l 1245
On (23/03/16 15:49), Patrick Coleman wrote:
Hi,
We run sssd to bind a number of machines to LDAP for auth. On a subset of these machines, we have software that makes several thousand IPv6 route changes per second.
Recently, we found that on these hosts the sssd_nss responder process fails several times a day[1], and will not recover until sssd is restarted. strace[2] of the main sssd process indicates that sssd is receiving many, many netlink messages - so many, in fact, that sssd cannot process them fast enough and is receiving ENOBUFS from recvmsg(2).
The messages that are received seem to get forwarded[3] to the sssd responders over the unix socket and flood them until they fail.
From what I can see, the netlink code in src/monitor/monitor_netlink.c:setup_netlink() subscribes to netlink notifications with the aim of detecting things like wifi network changes. This isn't something we'd find useful on our servers and seems to have performance implications - is there any easy way of turning off this functionality in sssd that I've missed?
We see this issue running sssd 1.11.7.
Cheers,
Patrick
- The failures look something like this. I have replaced our sss
domain with "ourdomain" /var/log/sssd/sssd_nss.log
(Tue Mar 22 02:58:01 2016) [sssd[nss]] [accept_fd_handler] (0x0100): Client connected! (Tue Mar 22 02:58:01 2016) [sssd[nss]] [nss_cmd_initgroups] (0x0100): Requesting info for [systemuser] from [<ALL>] (Tue Mar 22 02:58:01 2016) [sssd[nss]] [nss_cmd_initgroups_search] (0x0100): Requesting info for [systemuser@ourdomain] (Tue Mar 22 02:59:04 2016) [sssd[nss]] [nss_cmd_initgroups_dp_callback] (0x0040): Unable to get information from Data Provider Error: 3, 5, (null)
The real error is in sssd_$domain.log
neither sssd.log nor sssd_nss.log will help you.
@see https://fedorahosted.org/sssd/wiki/Troubleshooting
LS
Hi Lukas,
On 23 March 2016 at 15:58, Lukas Slebodnik lslebodn@redhat.com wrote:
On (23/03/16 15:49), Patrick Coleman wrote:
$ grep network_status_change_cb sssd.log | grep '06:59:12' | wc -l 1245
I believe the root cause of our problem is that sssd is receiving more than a thousand netlink notifications a second - because of IPv6 route updates - per above. While this might be something we can reduce on our end, it would be great if there was some way to unsubscribe sssd from receiving these.
Any ideas here appreciated - eg. if netlink is only used to get a more rapid notification of network changes, could we remove the call to setup_netlink in src/monitor/monitor.c in a local build? Does sssd rely on netlink for anything else?
(Wed Mar 23 06:59:07 2016) [sssd] [route_msg_debug_print] (0x1000): route idx 209591 flags 0X200 family 10 addr fd0a:9b09:1f7:0:218:aff:fe33:3b66/128 (Wed Mar 23 06:59:07 2016) [sssd] [network_status_change_cb] (0x2000): A networking status change detected signaling providers to reset offline status
If I interpret this correctly, every single netlink notification seems to be causing sssd to send a message over dbus?
The real error is in sssd_$domain.log
neither sssd.log nor sssd_nss.log will help you.
I see nothing wrong in the domain logs. I've included samples of normal operation before the failure this morning in the domain log below in any case. If there's anything further I can provide let me know.
Cheers,
Patrick
(Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sdap_initgr_rfc2307_next_base] (0x0400): Searching for groups with base [dc=meraki,dc=com] (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sdap_get_generic_ext_step] (0x0400): calling ldap_search_ext with [(&(memberuid=blinken)(objectClass=posixGroup)(cn=*)(&(gidNumber=*)(!(gidNumber=0))))][dc=meraki,dc=com]. (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sdap_get_generic_ext_done] (0x0400): Search result: Success(0), no errmsg set (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sdap_get_groups_next_base] (0x0400): Searching for groups with base [dc=meraki,dc=com] (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sdap_get_generic_ext_step] (0x0400): calling ldap_search_ext with [(&(gidNumber=1111)(objectClass=posixGroup)(cn=*)(&(gidNumber=*)(!(gidNumber=0))))][dc=meraki,dc=com]. (Wed Mar 23 06:54:04 2016) [sssd[be[meraki]]] [sdap_get_generic_ext_done] (0x0400): Search result: Success(0), no errmsg set (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sdap_get_groups_process] (0x0400): Search for groups, returned 1 results. (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sdap_attrs_get_sid_str] (0x0080): No [objectSID] attribute while id-mapping. [0][Success] (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sdap_get_primary_name] (0x0400): Processing object blinken (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sdap_save_group] (0x0400): Processing group blinken (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sdap_save_group] (0x0400): Original USN value is not available for [blinken]. (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sdap_process_ghost_members] (0x0400): Group has 0 members (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sdap_save_group] (0x0400): Storing info for group blinken (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sysdb_search_group_by_name] (0x0400): No such entry (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sysdb_search_group_by_gid] (0x0400): No such entry (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [acctinfo_callback] (0x0100): Request processed. Returned 0,0,Success (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [be_get_account_info] (0x0100): Got request for [4099][1][name=blinken] (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [be_req_set_domain] (0x0400): Changing request domain from [ourdomain] to [ourdomain] (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sdap_get_initgr_next_base] (0x0400): Searching for users with base [dc=meraki,dc=com] (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sdap_get_generic_ext_step] (0x0400): calling ldap_search_ext with [(&(uid=blinken)(objectclass=posixAccount)(&(uidNumber=*)(!(uidNumber=0))))][dc=meraki,dc=com]. (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sdap_get_generic_ext_done] (0x0400): Search result: Success(0), no errmsg set (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sdap_save_user] (0x0400): Save user (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sdap_attrs_get_sid_str] (0x0080): No [objectSID] attribute while id-mapping. [0][Success] (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sdap_get_primary_name] (0x0400): Processing object blinken (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sdap_save_user] (0x0400): Processing user blinken (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sdap_save_user] (0x0400): Original memberOf is not available for [blinken]. (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sdap_save_user] (0x0400): Original USN value is not available for [blinken]. (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sdap_save_user] (0x0400): User principal is not available for [blinken]. (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sdap_save_user] (0x0400): Storing info for user blinken (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sysdb_search_user_by_name] (0x0400): No such entry (Wed Mar 23 06:54:04 2016) [sssd[be[ourdomain]]] [sysdb_search_user_by_uid] (0x0400): No such entry (Wed Mar 23 06:54:05 2016) [sssd[be[ourdomain]]] [sdap_initgr_rfc2307_next_base] (0x0400): Searching for groups with base [dc=meraki,dc=com] (Wed Mar 23 06:54:05 2016) [sssd[be[ourdomain]]] [sdap_get_generic_ext_step] (0x0400): calling ldap_search_ext with [(&(memberuid=blinken)(objectClass=posixGroup)(cn=*)(&(gidNumber=*)(!(gidNumber=0))))][dc=meraki,dc=com]. (Wed Mar 23 06:54:05 2016) [sssd[be[ourdomain]]] [sdap_get_generic_ext_done] (0x0400): Search result: Success(0), no errmsg set (Wed Mar 23 06:54:05 2016) [sssd[be[ourdomain]]] [sdap_get_groups_next_base] (0x0400): Searching for groups with base [dc=meraki,dc=com] (Wed Mar 23 06:54:05 2016) [sssd[be[ourdomain]]] [sdap_get_generic_ext_step] (0x0400): calling ldap_search_ext with [(&(gidNumber=1111)(objectClass=posixGroup)(cn=*)(&(gidNumber=*)(!(gidNumber=0))))][dc=meraki,dc=com]. (Wed Mar 23 06:54:05 2016) [sssd[be[ourdomain]]] [sdap_get_generic_ext_done] (0x0400): Search result: Success(0), no errmsg set (Wed Mar 23 06:54:05 2016) [sssd[be[ourdomain]]] [sdap_get_groups_process] (0x0400): Search for groups, returned 1 results. (Wed Mar 23 06:54:05 2016) [sssd[be[ourdomain]]] [sdap_attrs_get_sid_str] (0x0080): No [objectSID] attribute while id-mapping. [0][Success] (Wed Mar 23 06:54:05 2016) [sssd[be[ourdomain]]] [sdap_get_primary_name] (0x0400): Processing object blinken (Wed Mar 23 06:54:05 2016) [sssd[be[ourdomain]]] [sdap_save_group] (0x0400): Processing group blinken (Wed Mar 23 06:54:05 2016) [sssd[be[ourdomain]]] [sdap_save_group] (0x0400): Original USN value is not available for [blinken]. (Wed Mar 23 06:54:05 2016) [sssd[be[ourdomain]]] [sdap_process_ghost_members] (0x0400): Group has 0 members (Wed Mar 23 06:54:05 2016) [sssd[be[ourdomain]]] [sdap_save_group] (0x0400): Storing info for group blinken (Wed Mar 23 06:54:05 2016) [sssd[be[ourdomain]]] [sysdb_search_group_by_name] (0x0400): No such entry (Wed Mar 23 06:54:05 2016) [sssd[be[ourdomain]]] [sysdb_search_group_by_gid] (0x0400): No such entry
On (23/03/16 17:47), Patrick Coleman wrote:
Hi Lukas,
On 23 March 2016 at 15:58, Lukas Slebodnik lslebodn@redhat.com wrote:
On (23/03/16 15:49), Patrick Coleman wrote:
$ grep network_status_change_cb sssd.log | grep '06:59:12' | wc -l 1245
I believe the root cause of our problem is that sssd is receiving more than a thousand netlink notifications a second - because of IPv6 route updates - per above. While this might be something we can reduce on our end, it would be great if there was some way to unsubscribe sssd from receiving these.
Any ideas here appreciated - eg. if netlink is only used to get a more rapid notification of network changes, could we remove the call to setup_netlink in src/monitor/monitor.c in a local build? Does sssd rely on netlink for anything else?
ahh, netlink might explain your observation.
It isn't possible to change it in runtime IIRC. You might recompile sssd with --with-libnl=no
LS
On 23 March 2016 at 19:18, Lukas Slebodnik lslebodn@redhat.com wrote:
ahh, netlink might explain your observation.
It isn't possible to change it in runtime IIRC. You might recompile sssd with --with-libnl=no
Thanks Lukas, rebuilding the package without netlink seems to have fixed the issue. Will test further next week.
Appreciate the quick response.
Cheers,
Patrick
sssd-users@lists.fedorahosted.org