On Wed, Apr 15, 2015 at 10:58:12PM +0200, Jean-Baptiste Denis wrote:
A shot in the dark but maybe worth a try - can you try disabling the cleanup task?
ldap_purge_cache_timeout = 0
in the [domain] section. The cleanup might cause some groups with no members to be removed, I wonder if that is your case..
Just did this, but didn't work.
Maybe I don't understand the purpose of this test, but the result does not surprise me because the ldap cache is empty at that time. As Thomas stated in the initial message of this thread, our actual test case implies:
. /etc/init.d/sssd stop . rm -rf /var/lib/sss/mc/* /var/lib/sss/db/* . /etc/init.d/sssd start
before running anything else. So I guess the ldap backend has no need to be cleaned up at this particular time.
I was suspecting a race condition, because as well as the rest of SSSD, the cleanup task is asynchronous. I was suspecting the following might have happened: - initgroups starts: - users are written to the cache - groups are written to the cache but not linked yet to the user objects - cleanup tasks starts - cleanup task removes the group objects because they are "empty". It shouldn't happen because the cleanup task should only remove expired entries, but IIRC Lukas saw a similar race-condition elsewhere.
If I run the test case again without restarting sssd and without cleaning up the cache, I've got no problem for next jobs (maybe until the next ldap purge. I think that this is exactly how we first encounter the problem : sometimes, some jobs were failing with a permission denied error while accessing a directory owned by one the user supplementary groups. The instrumented slurmd code showed us that the initgroups was not correctly getting the secondary groups. And the sssd backend log showed some purge activity if I remember correcty - need confirmation -)
In a previous message, you said :
I think this means the frontend (responder) either checks too soon or the back end wrote incomplete data.
We are not 100% sure that we've found the right place to look at, but each time we instrumented the code to print the number of groups, we've got the correct answer.
Maybe you could show us where to look exactly for :
- where the backend is writing the groups data to the sysdb cache
So the operation that evaluates what groups the user is a member of is called initgroups. IIRC you're using the rfc2307 (non-bis) schema, so the initgroups request that you run starts at src/providers/ldap/sdap_async_initgroups.c:385 in function sdap_initgr_rfc2307_send() and ends at sdap_initgr_rfc2307_recv()
- where the backend is signaling to the responder that the cache has been updated
The schema-specific request is the one I listed above, then returns to the generic LDAP code in ldap_common.c. The function that signals over sbus (dbus protocol used over unix socket) is at sdap_handler_done(), in particular be_req_terminate()
- where the responder is aware that he can now check the cache to get the answer
This is done in src/responder/common/responder_dp.c. The request is sent with sss_dp_get_account_send().
This code is a bit complex, because concurrent requests are just added to queue in sss_dp_issue_request() if the corresponding request is already found in rctx->dp_request_table hash table. But the first request that finishes would receive an sbus message from the provider in sss_dp_internal_get_done(). Then it would iterate over the queue of requests and mark them as done or failed.o
The callback that should be invoked by this generic NSS code is nss_cmd_getby_dp_callback().
- where the responder is actually getting the data from the sysdb cache
src/responder/nss/nsssrv_cmd.c, in particular nss_cmd_initgroups_search() and the function check_cache().