On Tue, Jan 26, 2016 at 3:03 PM, Jakub Hrozek jhrozek@redhat.com wrote:
On Tue, Jan 26, 2016 at 02:19:42PM -0500, James Ralston wrote:
Here's the problem: unless the user/group objects already happen to be in sssd's cache, enumerating the passwd/group entries in this way is very slow: 3-5 entries per second, at best. For a larger AD domain, the program can take 10-15 minutes to perform this iterative enumeration, which is much longer than we'd prefer.
Can anyone think of a way to make this iterative enumeration go faster?
Did you try mounting the cache to tmpfs to get rid of the cache writes?
[...]
That's… a very clever idea.
From testing using tmpfs to back /var/lib/sss/db, the speed of lookups increases by about an order of magnitude: about 44 lookups per second, instead of 4-5 lookups per second. We have around 5,000 AD objects, so the ~100 second wait would be tolerable.
A related question: is there any possibility of adding an option to the ad backend to disable the filtering of distribution groups (group type flag 0x8)?
It's a long story, but what we are trying to do here is to take regular snapshots of our AD users and groups, and sssd's getpwnam()/getgrnam() mapping is the perfect way to do it. I think I understand why distribution groups are filtered by default (they're not security-enabled in AD, and can't be used in Windows ACLs), but in this one particular case, we really do want to be able to enumerate every single group.
On Tue, Jan 26, 2016 at 05:50:06PM -0500, James Ralston wrote:
On Tue, Jan 26, 2016 at 3:03 PM, Jakub Hrozek jhrozek@redhat.com wrote:
On Tue, Jan 26, 2016 at 02:19:42PM -0500, James Ralston wrote:
Here's the problem: unless the user/group objects already happen to be in sssd's cache, enumerating the passwd/group entries in this way is very slow: 3-5 entries per second, at best. For a larger AD domain, the program can take 10-15 minutes to perform this iterative enumeration, which is much longer than we'd prefer.
Can anyone think of a way to make this iterative enumeration go faster?
Did you try mounting the cache to tmpfs to get rid of the cache writes?
[...]
That's… a very clever idea.
From testing using tmpfs to back /var/lib/sss/db, the speed of lookups increases by about an order of magnitude: about 44 lookups per second, instead of 4-5 lookups per second. We have around 5,000 AD objects, so the ~100 second wait would be tolerable.
A related question: is there any possibility of adding an option to the ad backend to disable the filtering of distribution groups (group type flag 0x8)?
I'm glad it helped. FWIW, we're considering adding a nosync option to the cache as well at some point, which should have the same performance effect as using tmpfs except the cache would be persistent (otoh, if sssd was killed during the transaction, the cache might got corrupt..which is why always sync by default)
It's a long story, but what we are trying to do here is to take regular snapshots of our AD users and groups, and sssd's getpwnam()/getgrnam() mapping is the perfect way to do it. I think I understand why distribution groups are filtered by default (they're not security-enabled in AD, and can't be used in Windows ACLs), but in this one particular case, we really do want to be able to enumerate every single group.
can you try setting: ldap_group_type = nosuchattr ?
That should trick sssd into not seeing the group type at all and would avoid filtering I guess (not tested).
On Wed, 27 Jan 2016, Jakub Hrozek wrote:
I'm glad it helped. FWIW, we're considering adding a nosync option to the cache as well at some point, which should have the same performance effect as using tmpfs except the cache would be persistent (otoh, if sssd was killed during the transaction, the cache might got corrupt..which is why always sync by default)
Sounds like a great option to add. If you can't sanity check it, just deleting it if you don't know that it's been cleanly written may work?
jh
On Wed, Jan 27, 2016 at 09:43:21AM +0000, John Hodrien wrote:
On Wed, 27 Jan 2016, Jakub Hrozek wrote:
I'm glad it helped. FWIW, we're considering adding a nosync option to the cache as well at some point, which should have the same performance effect as using tmpfs except the cache would be persistent (otoh, if sssd was killed during the transaction, the cache might got corrupt..which is why always sync by default)
Sounds like a great option to add. If you can't sanity check it, just deleting it if you don't know that it's been cleanly written may work?
Yes, that might be one idea, write some 'canary' on shutdown and start fresh if the canary was not there.
Coming up in 1.14..
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 01/27/2016 05:27 AM, Jakub Hrozek wrote:
On Wed, Jan 27, 2016 at 09:43:21AM +0000, John Hodrien wrote:
On Wed, 27 Jan 2016, Jakub Hrozek wrote:
I'm glad it helped. FWIW, we're considering adding a nosync option to the cache as well at some point, which should have the same performance effect as using tmpfs except the cache would be persistent (otoh, if sssd was killed during the transaction, the cache might got corrupt..which is why always sync by default)
Sounds like a great option to add. If you can't sanity check it, just deleting it if you don't know that it's been cleanly written may work?
Yes, that might be one idea, write some 'canary' on shutdown and start fresh if the canary was not there.
Coming up in 1.14..
The problem is that when the cache is deleted, it's not just losing the remote data. The cache also maintains the cached credentials, so it would break in the following common scenario:
Road warrior is out at a customer site and forgets their power cable. By unfortunate chance, the battery runs out while the cache is being written to (for any of a thousand reasons). Once the machine is plugged back in and powered on, SSSD starts up and sees that the cache canary is missing, so it deletes the cache and starts anew. Now the user cannot log in because their cached credentials are no longer there (and since they're sitting in a hotel somewhere far from a direct hookup to their company network, they can't get back in).
This... would not be a good thing.
Now, I can certainly see an argument for having such a nosync (or deferred sync) option for machines that are expected to always be connected to the identity network (and as such are using SSSD mostly for performance and surviving the occasional outage hiccup). But I'd say that such an option, if added, should be VERY carefully documented to explain all of the things that could go wrong.
As an aside, there are plenty of other things that can go wrong when the cache is deleted, including manual overrides from the sss_override command as well as ID ranges if any of them had hash collisions or were using the autorid compat mode.
On Wed, 27 Jan 2016, Stephen Gallagher wrote:
Now, I can certainly see an argument for having such a nosync (or deferred sync) option for machines that are expected to always be connected to the identity network (and as such are using SSSD mostly for performance and surviving the occasional outage hiccup). But I'd say that such an option, if added, should be VERY carefully documented to explain all of the things that could go wrong.
I don't disagree with what you've said, and it's exactly this situation I'd be interested in. If they're a road warrior, I'd be much less likely to enable nosync.
As an aside, there are plenty of other things that can go wrong when the cache is deleted, including manual overrides from the sss_override command as well as ID ranges if any of them had hash collisions or were using the autorid compat mode.
Sure, but none of this ends up being worse than using tmpfs, which we currently resort to in order to get acceptable performance. nosync with canary sounds like it can only be better in my situation.
jh
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 01/27/2016 09:21 AM, John Hodrien wrote:
On Wed, 27 Jan 2016, Stephen Gallagher wrote:
Now, I can certainly see an argument for having such a nosync (or deferred sync) option for machines that are expected to always be connected to the identity network (and as such are using SSSD mostly for performance and surviving the occasional outage hiccup). But I'd say that such an option, if added, should be VERY carefully documented to explain all of the things that could go wrong.
I don't disagree with what you've said, and it's exactly this situation I'd be interested in. If they're a road warrior, I'd be much less likely to enable nosync.
As an aside, there are plenty of other things that can go wrong when the cache is deleted, including manual overrides from the sss_override command as well as ID ranges if any of them had hash collisions or were using the autorid compat mode.
Sure, but none of this ends up being worse than using tmpfs, which we currently resort to in order to get acceptable performance. nosync with canary sounds like it can only be better in my situation.
Actually, there is one slight difference: tmpfs won't persist across a reboot, but with the nosync-and-canary, it's possible that the cache could be destroyed during an SSSD package upgrade (for example).
Let's say that we introduced a bug and the canary doesn't get written in all cases (maybe we have a crash-on-shutdown bug somewhere). If you do a `yum|dnf update sssd`, this will restart SSSD as part of the process, to ensure that you are running the latest bits. If we crash during the shutdown, this restart might delete the cache unexpectedly.
On Wed, Jan 27, 2016 at 09:17:09AM -0500, Stephen Gallagher wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 01/27/2016 05:27 AM, Jakub Hrozek wrote:
On Wed, Jan 27, 2016 at 09:43:21AM +0000, John Hodrien wrote:
On Wed, 27 Jan 2016, Jakub Hrozek wrote:
I'm glad it helped. FWIW, we're considering adding a nosync option to the cache as well at some point, which should have the same performance effect as using tmpfs except the cache would be persistent (otoh, if sssd was killed during the transaction, the cache might got corrupt..which is why always sync by default)
Sounds like a great option to add. If you can't sanity check it, just deleting it if you don't know that it's been cleanly written may work?
Yes, that might be one idea, write some 'canary' on shutdown and start fresh if the canary was not there.
Coming up in 1.14..
The problem is that when the cache is deleted, it's not just losing the remote data. The cache also maintains the cached credentials, so it would break in the following common scenario:
Yes, but the majority of users who require this speed up require it for use on the IPA server itself for example.
Road warrior is out at a customer site and forgets their power cable. By unfortunate chance, the battery runs out while the cache is being written to (for any of a thousand reasons). Once the machine is plugged back in and powered on, SSSD starts up and sees that the cache canary is missing, so it deletes the cache and starts anew. Now the user cannot log in because their cached credentials are no longer there (and since they're sitting in a hotel somewhere far from a direct hookup to their company network, they can't get back in).
This... would not be a good thing.
Now, I can certainly see an argument for having such a nosync (or deferred sync) option for machines that are expected to always be connected to the identity network (and as such are using SSSD mostly for performance and surviving the occasional outage hiccup). But I'd say that such an option, if added, should be VERY carefully documented to explain all of the things that could go wrong.
Sure.
btw the other thing we've been talking about is only do write the entry when it actually changes. Most of the time, when we refresh the entry from the server, nothing changes. The idea would be to write only the dataExpireTimestamp and other stampts to a separate ldb file that would be in nosync mode. Only write the data with full sync if something actually changes. That way, if we lose the nosync database, we'd only lose timestamp.s
As an aside, there are plenty of other things that can go wrong when the cache is deleted, including manual overrides from the sss_override command as well as ID ranges if any of them had hash collisions or were using the autorid compat mode.
-----BEGIN PGP SIGNATURE----- Version: GnuPG v2
iEYEARECAAYFAlao0WEACgkQeiVVYja6o6NXZACfeAEe0SVIz3iztqxOSNi/0ejf 9LQAoIkwUHQMn2cJeh2Ef7h/Uc+5Nj/H =iY7e -----END PGP SIGNATURE----- _______________________________________________ sssd-users mailing list sssd-users@lists.fedorahosted.org https://lists.fedorahosted.org/admin/lists/sssd-users@lists.fedorahosted.org
On Wed, Jan 27, 2016 at 10:24 AM, Jakub Hrozek jhrozek@redhat.com wrote:
btw the other thing we've been talking about is only do write the entry when it actually changes. Most of the time, when we refresh the entry from the server, nothing changes. The idea would be to write only the dataExpireTimestamp and other stampts to a separate ldb file that would be in nosync mode. Only write the data with full sync if something actually changes. That way, if we lose the nosync database, we'd only lose timestamps.
FWIW, I think this is the best solution: it would greatly accelerate the vast majority of lookups (all lookups except the initial one, or when an entry changes), but it would sidestep the cache coherency complexity that would result if syncs were disabled.
In contrast, adding an option to adjust the sync behavior would get sssd into the business of ensuring cache consistency, which is very difficult to get right. (Filesystem and database designers spend a lot of time addressing these issues.)
sssd-users@lists.fedorahosted.org