On Fri, Oct 18, 2019, at 11:41 AM, Jamal Mahmoud wrote:
Hi Jakub,
Apologies for the long delay in response as I was dragged away for other projects! So my previous (false)theory was a result of sssd_nss not being able to see entries that the sssd_be places into the memcache. It would not be able to find it's group so it would query any other domains it could see. I've now been able to isolate the issue down to an even smaller plausible cause.
After some digging through logs of an affected machine I've discovered a very interesting set of logs. This is quite verbose so bear with me. The command I ran was: $ getent group $GID
Here we see NSS requesting a GID number to be resolved/fetched: (Thu Oct 17 17:02:07 2019) [sssd[nss]] [nss_getby_id] (0x0400): Input ID: 1000001111 (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_send] (0x0400): CR #187: New request 'Group by ID' (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_select_domains] (0x0400): CR #187: Performing a multi-domain search (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_domains] (0x0400): CR #187: Search will check the cache and check the data provider
First it checks the NSS cache to see if an entry is present in it's cache: (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_set_domain] (0x0400): CR #187: Using domain [mydomain.com] (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_send] (0x0400): CR #187: Looking up GID:1000001111@mydomain.com (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_ncache] (0x0400): CR #187: Checking negative cache for [GID:1000001111@mydomain.com] (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_ncache] (0x0400): CR #187: [GID:1000001111@mydomain.com] is not present in negative cache (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_cache] (0x0400): CR #187: Looking up [GID:1000001111@mydomain.com] in cache (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_cache] (0x0400): CR #187: Object [GID:1000001111@mydomain.com] was not found in cache
It's not present in the cache according to NSS so it requests the backend to search the domain provider for this entry: (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_dp] (0x0400): CR #187: Looking up [GID:1000001111@mydomain.com] in data provider (Thu Oct 17 17:02:07 2019) [sssd[nss]] [sss_dp_issue_request] (0x0400): Issuing request for [0x55ee5f5d1b10:2:1000001111@mydomain.com] (Thu Oct 17 17:02:07 2019) [sssd[nss]] [sss_dp_get_account_msg] (0x0400): Creating request for [mydomain.com][0x2][BE_REQ_GROUP][idnumber=1000001111:-] (Thu Oct 17 17:02:07 2019) [sssd[nss]] [sss_dp_internal_get_send] (0x0400): Entering request [0x55ee5f5d1b10:2:1000001111@mydomain.com]
The domain provider finds the entry on the domain and fills the cache: (Thu Oct 17 17:01:42 2019) [sssd[be[mydomain.com]]] [sdap_save_group] (0x0400): Storing info for group TheGroup@mydomain.com (Thu Oct 17 17:01:42 2019) [sssd[be[mydomain.com]]] [sysdb_store_group] (0x1000): The group record of TheGroup@mydomain.com did not change, only updated the timestamp cache
What's interesting is that the sssd_be tells us that the entry was already present, yet nss was unaware of any entries in the cache, nss didn't even say (not exact quote) "entry found, needs updating", which after some testing on a working machine, is what occurs when nss encounters an entry that is out of date.
Here we see sssd_nss responding to the backend's return of the group entry. (Thu Oct 17 17:02:07 2019) [sssd[nss]] [sss_dp_get_reply] (0x1000): Got reply from Data Provider - DP error code: 0 errno: 0 error message: Success (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_cache] (0x0400): CR #187: Looking up [GID:1000001111@mydomain.com] in cache (Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_cache] (0x0400): CR #187: Object [GID:1000001111@mydomain.com] was not found in cache
Either the backend is entering data wrongly into the cache or nss is unable to read certain entries that are placed in the cache, I would like to think it is the former because the issue is so intermittent and for the most part, it works correctly. The only workaround we have found so far is stopping sssd.service, removing the ldbs from /var/lib/sss/db/ and restarting the service. This allows sssd to work correctly and return the groups correctly.
Thanks for digging thru this. I've experienced similar behavior, but never chased it down. No further workarounds I've discovered.
V/r, James Cassell
P.S., would be helpful if you include context in your replies.