On (18/04/15 03:27), Jean-Baptiste Denis wrote:
On 04/16/2015 12:31 PM, Jean-Baptiste Denis wrote:
No, it shouldn't be. The whole backend request should run and only then the backend should signal to frontend to re-check the cache. That's why I was suspecting the cleanup task, it's asynchronous.
I think I've got a test case without involving slurm. It is quite reproductible on my machine. Since it looks like a race, you may need to tweak the parameter of the python script.
The basic idea is to run a bunch of process and wait for a slight amount of time before calling the initgroups libc function for a specific user
You have to log as root and not use sudo to prevent sssd cache to be populated before the test is started. You also *need* to cleanup sssd state before running the test.
usage:
## log as root ## check the number secondary group for a user using id for example # id jbdenis
uid=21489(jbdenis) gid=110(sis) groups=110(sis),3044(CIB),19(floppy),1177(dump-projets),56(netadm),3125(vpn-ssl-admin)
Here I've got 5 secondary groups (sis is my primary group)
## !! VERY IMPORTANT !! cleanup sssd state # /etc/init.d/sssd stop && rm -f /var/lib/sss/mc/* /var/lib/sss/db/* && /etc/init.d/sssd start
## run this program # python initgroups.py jbdenis 110 5 24 200 wrong number of secondary groups in process 17145 : 0 instead of 5 (sleep 55ms) wrong number of secondary groups in process 17149 : 0 instead of 5 (sleep 55ms) 2/24 failed
# first parameter is a login # second parameter is your primary gid (could be anything) # third parameter is your number of secondary groups # fourth parameter is the number of process you want to run concurrently # the last parameter is the maximum delay in milliseconds before calling initgroups (the delay is randomized up to this maximum)
I've got good results with 24 processes and randomized delay of 200ms between startup. Those parameters are somewhat relative to the machine you're running the script on I guess. You may have to run this test multiple time before triggering the bug.
I'm unable to reproduce the bug when I use 0 delay and I think that why we could reproduce it with our initial test case.
I really hope that you could reproduce the bug on your side.
Thank you for your help,
I tried to reproduce bug with your script but I was not successful.
Domain section from sssd.conf [domain/refLDAP] id_provider = ldap auth_provider = ldap debug_level = 0xFFF0 ldap_uri = ldap://172.17.0.1 ldap_search_base = dc=example,dc=com ldap_schema = rfc2307bis ldap_group_object_class = groupOfNames timeout = 600 ldap_pwd_policy = shadow
I tried different values for number of process and maximum delay in milliseconds {1..12}x{50ms..300ms/step 10ms}
My laptop has 4 cores and "Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz"
There have to be something different in my configuration. Could you provide more information how to reproduce?
LS