A shot in the dark but maybe worth a try - can you try disabling the cleanup task?
ldap_purge_cache_timeout = 0
in the [domain] section. The cleanup might cause some groups with no members to be removed, I wonder if that is your case..
Just did this, but didn't work.
Maybe I don't understand the purpose of this test, but the result does not surprise me because the ldap cache is empty at that time. As Thomas stated in the initial message of this thread, our actual test case implies:
. /etc/init.d/sssd stop . rm -rf /var/lib/sss/mc/* /var/lib/sss/db/* . /etc/init.d/sssd start
before running anything else. So I guess the ldap backend has no need to be cleaned up at this particular time. If I run the test case again without restarting sssd and without cleaning up the cache, I've got no problem for next jobs (maybe until the next ldap purge. I think that this is exactly how we first encounter the problem : sometimes, some jobs were failing with a permission denied error while accessing a directory owned by one the user supplementary groups. The instrumented slurmd code showed us that the initgroups was not correctly getting the secondary groups. And the sssd backend log showed some purge activity if I remember correcty - need confirmation -)
In a previous message, you said :
I think this means the frontend (responder) either checks too soon or the back end wrote incomplete data.
We are not 100% sure that we've found the right place to look at, but each time we instrumented the code to print the number of groups, we've got the correct answer.
Maybe you could show us where to look exactly for :
- where the backend is writing the groups data to the sysdb cache - where the backend is signaling to the responder that the cache has been updated - where the responder is aware that he can now check the cache to get the answer - where the responder is actually getting the data from the sysdb cache
Thank you for your help,
Jean-Baptiste