First off, apologies for double posting to here and ipa mailing list, but we are getting a bit uneasy, and also the issue seems to come from the code in 389-ds directly, so this seems more appropriate.
We are using ipa-server ipa-server-4.6.5-11.el7.centos.3.x86_64 with 389-ds-base-1.3.9.1-10.el7.x86_64 on CentOS 7.7. Since couple days some of our replicas are coming with "csngen_new_csn - Sequence rollover; local offset updated." messages in the slapd erorr logs.
We use the python "ipa_check_consistency" and replication seems to be fine.
We checked all replicas, and they are all in time sync with ntp (updated) with no visible offset.
is this anything to worry about, and how can we make those messages to stop appearing?
On 22 Dec 2019, at 08:22, Christophe Trefois trefex@gmail.com wrote:
First off, apologies for double posting to here and ipa mailing list, but we are getting a bit uneasy, and also the issue seems to come from the code in 389-ds directly, so this seems more appropriate.
Hi there, thanks for contacting us. Happy your you to post here.
We are using ipa-server ipa-server-4.6.5-11.el7.centos.3.x86_64 with 389-ds-base-1.3.9.1-10.el7.x86_64 on CentOS 7.7. Since couple days some of our replicas are coming with "csngen_new_csn - Sequence rollover; local offset updated." messages in the slapd erorr logs.
This isn't a problem, but you should investigate the possible causes. The short answer is that we are pushing the lamport clock ahead due to either high writes or the system clock being stepped backwards.
To see the code look at:
https://pagure.io/389-ds-base/blob/master/f/ldap/servers/slapd/csngen.c#_195
You should probably for sanity checking investigate:
* If you have high write load in your environment that is not expected * If you have issues with ntp consistency on your machines (continually advancing or reversing) * Conflict between a virtualised time sync service is vmware/libvirt vs ntp causing time jumps
For a slightly longer explanation. The CSN is a lamport clock, IE it can only advance, but never step back. It's based on the current unix time in seconds, with a sub-counter that is 16 bit. IE we can have 65535 writes "per second".
This is because if you have say:
Write object A Ntp syncs clock backwards Write object B
We need the CSN of these to still reflect the true order of operations - that A occurs before B, as we use time as the sync source between replicas rather than locking/consensus. If the CSN didn't use lamport clock the changelog would show B before A which is incorrect for reasons that are extremely complex and subtle.
So with the CSN being a lamport clock, if ntp sets your time backwards, the CSN stays at the "highest" time, and the subcounter keeps incrementing. If this continues for a long time, we overflow the 16bit sub counter - we can't have duplicate CSN so the local offset (aka seconds) is increased to push the CSN's always forward.
That's why I recommend you check your write load and ntp/system time.
Hope that helps,
We use the python "ipa_check_consistency" and replication seems to be fine.
We checked all replicas, and they are all in time sync with ntp (updated) with no visible offset.
is this anything to worry about, and how can we make those messages to stop appearing? _______________________________________________ 389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject....
— Sincerely,
William Brown
Senior Software Engineer, 389 Directory Server SUSE Labs
Dear @William
Thank you a lot for this nice answer and explanations.
We did update and harmonise the ntp server of all 8 FreeIPA servers that we have and continued to see a lot of these errors above over the next couple weeks.
However, since 10 days ago, all messages of this sort stopped and have not yet returned.
Is there some catch up mechanism that somehow was able to "heal" ?
Thanks a lot for your insights, Christophe
On 6 Feb 2020, at 10:09, Christophe Trefois trefex@gmail.com wrote:
Dear @William
Thank you a lot for this nice answer and explanations.
We did update and harmonise the ntp server of all 8 FreeIPA servers that we have and continued to see a lot of these errors above over the next couple weeks.
However, since 10 days ago, all messages of this sort stopped and have not yet returned.
Is there some catch up mechanism that somehow was able to "heal" ?
Yes, there is a clock-skew stored on the servers, but over time if the NTP is now correct it "drifts back" to reality.
Imagine it that the CSN lamport clock is "in the future" and always has to advance, but if it advances just a bit "slowly", eventually NTP/system clocks will catch up, and they'll get back to normal.
Does that make sense?
Happy to have helped, and glad the issue has been resolved!
Thanks a lot for your insights, Christophe _______________________________________________ 389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject....
— Sincerely,
William Brown
Senior Software Engineer, 389 Directory Server SUSE Labs
Hi Christophe,
Like William said the message "csngen_new_csn - Sequence rollover; local offset updated." is not alarming and IMHO are likely related to the NTP update.
The sequence number (seq_num) is a sub second counter increased for each write operation. It can also jump ahead to a seq_num value of a received replicated update. In short if a replica timestamps an update with a high seq_num, this high seq_num will be propagated to the others replica. It looks not realistic to process 65K updates in the same second. So the only thing I can think of is a jump of system time (ahead and/or backward) on one of the replica. This jump can be transient.
If such jump occurred, the next CSNs will be timestamped with the same second but this timestamp will last longer than an actual second. In such case we can process many writes, timestamped in the same second, increasing only the sequence number.
The NTP update, likely triggered a jump. This jump was consumed by the regular clock ticking. During this period, seq_num rollover several times. Then when the jump was totally consumed, seq_num reset at each change of second and no longer rollover.
best regards thierry On 2/6/20 1:09 AM, Christophe Trefois wrote:
Dear @William
Thank you a lot for this nice answer and explanations.
We did update and harmonise the ntp server of all 8 FreeIPA servers that we have and continued to see a lot of these errors above over the next couple weeks.
However, since 10 days ago, all messages of this sort stopped and have not yet returned.
Is there some catch up mechanism that somehow was able to "heal" ?
Thanks a lot for your insights, Christophe _______________________________________________ 389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject....
389-users@lists.fedoraproject.org