Hello gurus,
We are running a 3 nodes FreeIPA cluster for some time without major trouble. One server may stale from time to time, without real trouble to restart it.
A few days ago, we had to migrate the VMs between two clouds (disk image copied from one to the other). They have been renumbered from old to new IPv4 address space. Not that easy, but we finally got it done with all DNS entries in sync. Yet, since the migration, ns-slapd process hangs randomly way more often than before (went from once every few months to several times a day) and is especially hard to restart on any node.
While starting up, the netstat output is like:
Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp6 184527 0 10.217.151.3:389 10.217.151.2:52314 ESTABLISHED 29948/ns-slapd
Netstat and tcpdump show it processes very slowly the recvq (sometimes like 79 bytes per 1-2 seconds). At some point it just stops processing it and hangs (only kill -9 works to take it down). When stale, strace shows the process loops only on :
getpeername(8, 0x7ffe62c49fd0, 0x7ffe62c49f94) = -1 ENOTCONN (Transport endpoint is not connected) poll([{fd=50, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=117, events=POLLIN}, {fd=116, events=POLLIN}, {fd=115, events=POLLIN}, {fd=114, events=POLLIN}, {fd=89, events=POLLIN}, {fd=85, events=POLLIN}, {fd=83, events=POLLIN}, {fd=82, events=POLLIN}, {fd=81, events=POLLIN}, {fd=80, events=POLLIN}, {fd=79, events=POLLIN}, {fd=78, events=POLLIN}, {fd=77, events=POLLIN}, {fd=76, events=POLLIN}, {fd=67, events=POLLIN}, {fd=72, events=POLLIN}, {fd=69, events=POLLIN}, {fd=64, events=POLLIN}, {fd=66, events=POLLIN}], 23, 250) = 0 (Timeout)
If it can go through startup replication, one of the server will hang a little bit later, freezing the whole cluster. Forcing us to restart the faulty node to unlock things.
When stale, the dirsrv access log only contains entries like: [20/Oct/2019:17:52:46.950029525 +0100] conn=86 fd=131 slot=131 connection from 10.217.151.4 to 10.217.151.4 [20/Oct/2019:17:52:51.280412883 +0100] conn=87 fd=132 slot=132 SSL connection from 10.217.151.10 to 10.217.151.4 [20/Oct/2019:17:52:54.956204031 +0100] conn=88 fd=133 slot=133 connection from 10.217.151.4 to 10.217.151.4 [20/Oct/2019:17:53:04.966542441 +0100] conn=89 fd=134 slot=134 connection from 10.217.151.2 to 10.217.151.4 [20/Oct/2019:17:53:22.659053020 +0100] conn=90 fd=135 slot=135 SSL connection from 10.217.151.10 to 10.217.151.4 [20/Oct/2019:17:53:51.006707605 +0100] conn=91 fd=136 slot=136 connection from 10.217.151.4 to 10.217.151.4 [20/Oct/2019:17:53:54.514162543 +0100] conn=92 fd=137 slot=137 SSL connection from 10.217.151.10 to 10.217.151.4 [20/Oct/2019:17:53:59.011602776 +0100] conn=93 fd=138 slot=138 connection from 10.217.151.3 to 10.217.151.4 [20/Oct/2019:17:54:09.019296900 +0100] conn=94 fd=139 slot=139 connection from 10.217.151.4 to 10.217.151.4
And netstat lists 10s of accepted network connections that are stale like : tcp6 286 0 10.217.151.4:389 10.217.151.10:32512 ESTABLISHED 29948/ns-slapd
The underlying network seams clean and uses jumbo frames. tcpdump and ping show 0 packet loss and no retransmit. Being afraid it could be a jumbo frame issue, mtu was even forced down to 1500. Without success.
Entropy seems fine as well : # cat /proc/sys/kernel/random/entropy_avail 3138
Running version on all servers: ipa-client-4.6.5-11.el7.centos.x86_64 ipa-client-common-4.6.5-11.el7.centos.noarch ipa-common-4.6.5-11.el7.centos.noarch ipa-server-4.6.5-11.el7.centos.x86_64 ipa-server-common-4.6.5-11.el7.centos.noarch ipa-server-dns-4.6.5-11.el7.centos.noarch
I'd happily listen to any hint regarding this critical problem.
/Sylvain.
Hi Sylvain,
I believe we had a similar issue in our configuration. I can dig in more tomorrow but, we had deadlocks with the retroCL plugin.
If you follow the steps outlined on this page, https://directory.fedoraproject.org/docs/389ds/FAQ/faq.html#debug_hangs to get a stack trace, I can try to see if you're hitting the same thing.
See https://bugzilla.redhat.com/show_bug.cgi?id=1751295 for some more details on the issue.
Hope that helps, Jared
On Sun, Oct 20, 2019, at 3:55 PM, Sylvain Coutant via FreeIPA-users wrote:
Hello gurus,
We are running a 3 nodes FreeIPA cluster for some time without major trouble. One server may stale from time to time, without real trouble to restart it.
A few days ago, we had to migrate the VMs between two clouds (disk image copied from one to the other). They have been renumbered from old to new IPv4 address space. Not that easy, but we finally got it done with all DNS entries in sync. Yet, since the migration, ns-slapd process hangs randomly way more often than before (went from once every few months to several times a day) and is especially hard to restart on any node.
While starting up, the netstat output is like:
Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp6 184527 0 10.217.151.3:389 10.217.151.2:52314 ESTABLISHED 29948/ns-slapd
Netstat and tcpdump show it processes very slowly the recvq (sometimes like 79 bytes per 1-2 seconds). At some point it just stops processing it and hangs (only kill -9 works to take it down). When stale, strace shows the process loops only on :
getpeername(8, 0x7ffe62c49fd0, 0x7ffe62c49f94) = -1 ENOTCONN (Transport endpoint is not connected) poll([{fd=50, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=117, events=POLLIN}, {fd=116, events=POLLIN}, {fd=115, events=POLLIN}, {fd=114, events=POLLIN}, {fd=89, events=POLLIN}, {fd=85, events=POLLIN}, {fd=83, events=POLLIN}, {fd=82, events=POLLIN}, {fd=81, events=POLLIN}, {fd=80, events=POLLIN}, {fd=79, events=POLLIN}, {fd=78, events=POLLIN}, {fd=77, events=POLLIN}, {fd=76, events=POLLIN}, {fd=67, events=POLLIN}, {fd=72, events=POLLIN}, {fd=69, events=POLLIN}, {fd=64, events=POLLIN}, {fd=66, events=POLLIN}], 23, 250) = 0 (Timeout)
If it can go through startup replication, one of the server will hang a little bit later, freezing the whole cluster. Forcing us to restart the faulty node to unlock things.
When stale, the dirsrv access log only contains entries like: [20/Oct/2019:17:52:46.950029525 +0100] conn=86 fd=131 slot=131 connection from 10.217.151.4 to 10.217.151.4 [20/Oct/2019:17:52:51.280412883 +0100] conn=87 fd=132 slot=132 SSL connection from 10.217.151.10 to 10.217.151.4 [20/Oct/2019:17:52:54.956204031 +0100] conn=88 fd=133 slot=133 connection from 10.217.151.4 to 10.217.151.4 [20/Oct/2019:17:53:04.966542441 +0100] conn=89 fd=134 slot=134 connection from 10.217.151.2 to 10.217.151.4 [20/Oct/2019:17:53:22.659053020 +0100] conn=90 fd=135 slot=135 SSL connection from 10.217.151.10 to 10.217.151.4 [20/Oct/2019:17:53:51.006707605 +0100] conn=91 fd=136 slot=136 connection from 10.217.151.4 to 10.217.151.4 [20/Oct/2019:17:53:54.514162543 +0100] conn=92 fd=137 slot=137 SSL connection from 10.217.151.10 to 10.217.151.4 [20/Oct/2019:17:53:59.011602776 +0100] conn=93 fd=138 slot=138 connection from 10.217.151.3 to 10.217.151.4 [20/Oct/2019:17:54:09.019296900 +0100] conn=94 fd=139 slot=139 connection from 10.217.151.4 to 10.217.151.4
And netstat lists 10s of accepted network connections that are stale like : tcp6 286 0 10.217.151.4:389 10.217.151.10:32512 ESTABLISHED 29948/ns-slapd
The underlying network seams clean and uses jumbo frames. tcpdump and ping show 0 packet loss and no retransmit. Being afraid it could be a jumbo frame issue, mtu was even forced down to 1500. Without success.
Entropy seems fine as well : # cat /proc/sys/kernel/random/entropy_avail 3138
Running version on all servers: ipa-client-4.6.5-11.el7.centos.x86_64 ipa-client-common-4.6.5-11.el7.centos.noarch ipa-common-4.6.5-11.el7.centos.noarch ipa-server-4.6.5-11.el7.centos.x86_64 ipa-server-common-4.6.5-11.el7.centos.noarch ipa-server-dns-4.6.5-11.el7.centos.noarch
I'd happily listen to any hint regarding this critical problem.
/Sylvain. _______________________________________________ FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org To unsubscribe send an email to freeipa-users-leave@lists.fedorahosted.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahoste...
freeipa-users@lists.fedorahosted.org