On Mon, Apr 13, 2015 at 11:24:22AM +0200, Jakub Hrozek wrote:
Thank you for the logs. Can you tell which was the first failing request?
It's hard to tell because, by definition, slurm tasks run in parallel.
In fill_initgr() we added the 2 following debug lines :
DEBUG(SSSDBG_TRACE_FUNC, "XXXX Retrieve %d groups\n", num-1); /* skip first entry, it's the user entry */ for (i = 0; i < num; i++) { gid = sss_view_ldb_msg_find_attr_as_uint64(dom, res->msgs[i + 1], SYSDB_GIDNUM, 0); posix = ldb_msg_find_attr_as_string(res->msgs[i + 1], SYSDB_POSIX, NULL); DEBUG(SSSDBG_TRACE_FUNC, "XXXX lookup entry %d/%d %d\n", i, num-1, gid); if (!gid) {
Most of the time num matches the correct number of supplementary groups but sometimes its value is 0 (when our problem occurs we guess).
How can we easily print the number of groups for each component/stage of the request/answer flow (backend, responder, cache, memcache) to narrow the search ?
Besides, can you elaborate on the so called fake groups and ghost users ? Our understanding is that fake groups are incomplete groups entries put in cache to reduce the load on the backend server and that ghost users are group attributes meant to avoid creating fake users as group members. Does it make sense to look in the direction of the fake groups to understand our problem ?
In particular (and if this understanding is correct) : is the responder aware that it is reading a fake group ? Which component job is it to fully resolve the fake group (the one which put it in cache or the one who needs the info) ?
Thanks