On 04/16/2015 12:31 PM, Jean-Baptiste Denis wrote:
No, it shouldn't be. The whole backend request should run and only then the backend should signal to frontend to re-check the cache. That's why I was suspecting the cleanup task, it's asynchronous.
I think I've got a test case without involving slurm. It is quite reproductible on my machine. Since it looks like a race, you may need to tweak the parameter of the python script.
The basic idea is to run a bunch of process and wait for a slight amount of time before calling the initgroups libc function for a specific user
You have to log as root and not use sudo to prevent sssd cache to be populated before the test is started. You also *need* to cleanup sssd state before running the test.
usage:
## log as root ## check the number secondary group for a user using id for example # id jbdenis
uid=21489(jbdenis) gid=110(sis) groups=110(sis),3044(CIB),19(floppy),1177(dump-projets),56(netadm),3125(vpn-ssl-admin)
Here I've got 5 secondary groups (sis is my primary group)
## !! VERY IMPORTANT !! cleanup sssd state # /etc/init.d/sssd stop && rm -f /var/lib/sss/mc/* /var/lib/sss/db/* && /etc/init.d/sssd start
## run this program # python initgroups.py jbdenis 110 5 24 200 wrong number of secondary groups in process 17145 : 0 instead of 5 (sleep 55ms) wrong number of secondary groups in process 17149 : 0 instead of 5 (sleep 55ms) 2/24 failed
# first parameter is a login # second parameter is your primary gid (could be anything) # third parameter is your number of secondary groups # fourth parameter is the number of process you want to run concurrently # the last parameter is the maximum delay in milliseconds before calling initgroups (the delay is randomized up to this maximum)
I've got good results with 24 processes and randomized delay of 200ms between startup. Those parameters are somewhat relative to the machine you're running the script on I guess. You may have to run this test multiple time before triggering the bug.
I'm unable to reproduce the bug when I use 0 delay and I think that why we could reproduce it with our initial test case.
I really hope that you could reproduce the bug on your side.
Thank you for your help,
Jean-Baptiste