On Fri, Apr 10, 2015 at 03:26:02PM +0200, Thomas HUMMEL wrote:
We tried with sssd 1.12.4 and it doesn't fix the problem
Further on the debug process, we wanted to know if the problem comes from slurm, glibc or sssd. Here's what we've tried :
1. we hacked slurmd code to add a getgroups() call before and after slurm calls initgroups() :
debug2("Uncached user/gid: %s/%ld", job->user_name, (long)job->gid); debug2("Before initgroups number of groups for %s/%ld : %d", job->user_name, (long)job->gid, getgroups(0, NULL)); if ((rc = initgroups(job->user_name, job->gid))) { if ((errno == EPERM) && (getuid() != (uid_t) 0)) { debug("Error in initgroups(%s, %ld): %m", job->user_name, (long)job->gid); } else { error("Error in initgroups(%s, %ld): %m", job->user_name, (long)job->gid); } return -1; } debug2("After initgroups number of groups for %s/%ld : %d", job->user_name, (long)job->gid, getgroups(0, NULL)); return 0;
-> when the problem occurs (note that slurmd is running as root before dropping privileges) :
Apr 10 17:10:28 myriad-n407 slurmstepd[7219]: Before initgroups number of groups for njoly/3044 : 0 Apr 10 17:10:28 myriad-n407 slurmstepd[7219]: After initgroups number of groups for njoly/3044 : 1
-> when the problem does not occur
Apr 10 17:32:14 myriad-n407 slurmstepd[11075]: Before initgroups number of groups for njoly/3044 : 0 Apr 10 17:32:14 myriad-n407 slurmstepd[11075]: After initgroups number of groups for njoly/3044 : 11
So our understanding is that slurm is not to blame
Note : in previous tests where we put a getgroups() elsewhere in the code, sometimes we noticed that more than one group was retrieved. So sometimes a subset of the supplementary groups is retrieved.
2. We stopped sssd and remove the cache files (mc/* db/*) and put the user in /etc/passwd file and all his supplementary (as well as his primary group) groups in /etc/group :
-> the problem does not occur anymore
So we think that glibc is not to blame either.
Conclusion : it seems to us that it really is an sssd problem. Can you hint us somewhere in the sssd source code we can start to further investigate because we are unable to build a test case without slurm.
Thanks