I was suspecting a race condition, because as well as the rest of SSSD, the cleanup task is asynchronous. I was suspecting the following might have happened: - initgroups starts: - users are written to the cache - groups are written to the cache but not linked yet to the user objects - cleanup tasks starts - cleanup task removes the group objects because they are "empty". It shouldn't happen because the cleanup task should only remove expired entries, but IIRC Lukas saw a similar race-condition elsewhere.
"groups are written to the cache but not linked yet to the user objects"
Is it possible for the responder to answer a client about groups information before the groups are written to the cache AND linked to it ? That's what the getgroups syscall (from the client) returning the wrong number of group would suggest when the problem occurs. Could that be related to ghost or fake entries ?
Maybe you could show us where to look exactly for :
- where the backend is writing the groups data to the sysdb cache
So the operation that evaluates what groups the user is a member of is called initgroups. IIRC you're using the rfc2307 (non-bis) schema, so the initgroups request that you run starts at src/providers/ldap/sdap_async_initgroups.c:385 in function sdap_initgr_rfc2307_send() and ends at sdap_initgr_rfc2307_recv()
- where the backend is signaling to the responder that the cache has been updated
The schema-specific request is the one I listed above, then returns to the generic LDAP code in ldap_common.c. The function that signals over sbus (dbus protocol used over unix socket) is at sdap_handler_done(), in particular be_req_terminate()
- where the responder is aware that he can now check the cache to get the answer
This is done in src/responder/common/responder_dp.c. The request is sent with sss_dp_get_account_send().
This code is a bit complex, because concurrent requests are just added to queue in sss_dp_issue_request() if the corresponding request is already found in rctx->dp_request_table hash table. But the first request that finishes would receive an sbus message from the provider in sss_dp_internal_get_done(). Then it would iterate over the queue of requests and mark them as done or failed.o
The callback that should be invoked by this generic NSS code is nss_cmd_getby_dp_callback().
- where the responder is actually getting the data from the sysdb cache
src/responder/nss/nsssrv_cmd.c, in particular nss_cmd_initgroups_search() and the function check_cache().
Thank you for this extensive answer. We were quite close to this understanding. We'll try to dig more.
Jean-Baptiste