Hi everyone, we are running a small 389 DS cluster on two RHEL 7.4 machines. The version installed is the most recent in the Red Hat repositories, 1.3.6.1-19.el7_4. 389 DS is used as user storage for the Keycloak single sign-on system. It contains about 150k person objects.
To test the whole system, we are running load tests each night. These tests login 100 users per second in Keycloak for 15 minutes, which in turn authenticates the users against 389 DS. On our machines, this normally results in a very low CPU load by 389 DS, about 10-25%.
Up to now we used SSHA512 as password hashing algorithm. We now would like to switch to PBKDF2: As a first test, we changed the password of the user that Keycloak uses to bind to 389 DS to PBKDF2 hashing. In this configuration, we encountered a problem: When running the load tests, the system behaves normally for the first few minutes. After this, 389 DS CPU usage suddenly jumps to almost 800% on one of the servers (the machines have 8 CPUs) and authentications become very slow. This continues for the remaining runtime of the load test. When running the test again, 389 DS again behaves normally for the first few minutes, then CPU usage jumps to 800%.
When changing the password hash back to SSHA512, everything is fine again.
To me this looks like a bug in 389 DS. Please let me know what information to provide so you can investigate.
Thanks, Marian
On Thu, 2017-09-28 at 08:58 +0000, Marian Rainer-Harbach wrote:
Hi everyone, we are running a small 389 DS cluster on two RHEL 7.4 machines. The version installed is the most recent in the Red Hat repositories, 1.3.6.1-19.el7_4. 389 DS is used as user storage for the Keycloak single sign-on system. It contains about 150k person objects.
To test the whole system, we are running load tests each night. These tests login 100 users per second in Keycloak for 15 minutes, which in turn authenticates the users against 389 DS. On our machines, this normally results in a very low CPU load by 389 DS, about 10-25%.
Up to now we used SSHA512 as password hashing algorithm. We now would like to switch to PBKDF2: As a first test, we changed the password of the user that Keycloak uses to bind to 389 DS to PBKDF2 hashing. In this configuration, we encountered a problem: When running the load tests, the system behaves normally for the first few minutes. After this, 389 DS CPU usage suddenly jumps to almost 800% on one of the servers (the machines have 8 CPUs) and authentications become very slow. This continues for the remaining runtime of the load test. When running the test again, 389 DS again behaves normally for the first few minutes, then CPU usage jumps to 800%.
When changing the password hash back to SSHA512, everything is fine again.
To me this looks like a bug in 389 DS. Please let me know what information to provide so you can investigate.
Hey mate,
It is and is not a bug at the same time.
The root issue is confidential (I don't know why I don't think it needs to be)
https://bugzilla.redhat.com/show_bug.cgi?id=1439272
Because we use the NSS crypto provider, we are affected by this. the tl;dr is that NSS uses the wrong pbkdf2 algo, so it's twice as slow as openssl.
We also set a time factor in the pbkdf2 code. You can see that here:
https://pagure.io/389-ds-base/blob/master/f/ldap/servers/plugins/pwdstorage/...
The summary is we want an attacker to have to spend a minimal amount of time to attempt a hash, in this case: https://pagure.io/389-ds-base/blob/master/f/ldap/servers/plugins/pwdstorage/...
Finally, we have a cap that says "if we think it's too little, use this baseline", here it's 10,000
So what that amounts to is:
* Either our time factor is too high, and that causes your high CPU and latency, * Your machine at start up calculated say .... 3000 rounds for the time cap, but then pushed it up to 10,000, and this is amplified by the NSS implementation bug.
I would be happy to open a bug about this. I think that perhaps the issue is that our minimum is too high, and that our time factor is too high.
https://pagure.io/389-ds-base/issue/49387
Hope that helps,
Hi William, thanks for the explanation and links to the code!
However, I'm not sure if this is really related to the problem I'm seeing. As I tried to explain above, the first few minutes of a load test run fine, but then (after a variable time) there is a sudden jump from <25% CPU to almost 800% CPU.
If I understood the calculation of the number of rounds in the code correctly, wouldn't the behavior of 389 DS be consistent across the whole load test? I.e., wouldn't the high CPU usage happen right from the start of a test run?
Best regards, Marian
On Mon, 2017-10-02 at 09:56 +0000, Marian Rainer-Harbach wrote:
Hi William, thanks for the explanation and links to the code!
However, I'm not sure if this is really related to the problem I'm seeing. As I tried to explain above, the first few minutes of a load test run fine, but then (after a variable time) there is a sudden jump from <25% CPU to almost 800% CPU.
If I understood the calculation of the number of rounds in the code correctly, wouldn't the behavior of 389 DS be consistent across the whole load test? I.e., wouldn't the high CPU usage happen right from the start of a test run?
This was my thinking also, but you said the problem *only* occurred with pbkdf2, which means this must be the source of the issue.
The way the rounds calculate on startup was likely to mean that you would have 10,000 rounds (the minimum), which would exceed the 40ms work fact that we would aim for. As a result, each bind starts to take say ... 100ms lets say. As clients start to join it's initially "quick" but as you add more clients connecting, not only do you not have enough cpu to service all the clients, these times elongate and you have more clients queued, so you saturate the CPU completely.
If you suspect the issue is elsewhere, during the highload, I'm going to ask you to take a pstack of the process and send it to me to analyse to see where the server is processing,
Hope that helps,
Hi William,
I'm sorry it took me so long to get back to you! I get what you're saying and it seems like a possible explanation for some scenarios.
However, I'm still not sure it applies to my case, where 389 DS behaves completely normal for a few minutes and then an instantaneous jump to full CPU usage occurs.
I created three pstack traces: 1: https://pastebin.com/iBCbmpuk was taken while the load test was already running and 389 DS still behaved normally.
2: https://pastebin.com/aVbCxjeT is one minute later. The same load test is still running, but now 389 DS uses all CPUs and is very slow.
3: https://pastebin.com/Bsv0uefY is after stopping the load test, 389 DS is now idle.
It would be great if you could take a look at those traces!
Thanks again, Marian
On Thu, 2017-11-02 at 08:57 +0000, Marian Rainer-Harbach wrote:
Hi William,
I'm sorry it took me so long to get back to you! I get what you're saying and it seems like a possible explanation for some scenarios.
However, I'm still not sure it applies to my case, where 389 DS behaves completely normal for a few minutes and then an instantaneous jump to full CPU usage occurs.
I created three pstack traces: 1: https://pastebin.com/iBCbmpuk was taken while the load test was already running and 389 DS still behaved normally.
2: https://pastebin.com/aVbCxjeT is one minute later. The same load test is still running, but now 389 DS uses all CPUs and is very slow.
3: https://pastebin.com/Bsv0uefY is after stopping the load test, 389 DS is now idle.
It would be great if you could take a look at those traces!
Hey there,
Thread wise, I can see that the time in pbkdf2 is likely your issue. You are not consuming all workers though, as you can see many threads are in pthread_cond_wait/connection_threadmain.
However, I still think the initial assessment was correct. Have you tried out the latest version with the pbkdf2 timing changes applied?
Hope that helps!
Hi William,
thanks for looking at the stack traces! I upgraded to the latest version available in RHEL 7.4 (1.3.6.1-21.el7_4), this didn't change the behavior. I guess that the changes you're talking about aren't yet available in RHEL.
So I think that we'll have to remain on SSHA512 hashing for now and try PBKDF2 again sometime later.
Best regards, Marian
On Thu, 2017-11-02 at 13:07 +0000, Marian Rainer-Harbach wrote:
Hi William,
thanks for looking at the stack traces! I upgraded to the latest version available in RHEL 7.4 (1.3.6.1-21.el7_4), this didn't change the behavior. I guess that the changes you're talking about aren't yet available in RHEL.
So I think that we'll have to remain on SSHA512 hashing for now and try PBKDF2 again sometime later.
The changes should be present in 1.3.7 of 389-ds-base, aka a future version of the product.
Thanks for trying this out!
Best regards, Marian _______________________________________________ 389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.o rg
389-users@lists.fedoraproject.org