Re: ns-slapd hangs several times a day - FreeIPA-users - Fedora mailing-lists

21 Oct 2019

      //Keeping the freeipa-users list included
Hi Sylvain,
Ah shucks. For the debug info packages, you might be able to leverage the packages available on https://dl.fedoraproject.org/pub/epel/ but, someone else on this list might know more.
The only other idea that comes to mind is what was discussed on https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahoste...
Specifically, in your 'cn=config' see if nsslapd-enable-nunc-stans is set to on, if it is, try using the ldapmodify in the above comment to disable it. For context, that defaults to off as a result of https://bugzilla.redhat.com/show_bug.cgi?id=1614501
Hope this helps,
Jared
On Mon, Oct 21, 2019, at 5:31 AM, Sylvain Coutant wrote:
...
Jared,
Thanks for your message. To be honest, I would have expected this to be a network issue at first. All the symptoms are there. But tcpdump tells me that things are ok ...
For the debuginfo, I already had a look but was unable to find the right packages:
yum search --enablerepo=epel-debuginfo 389-ds
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile

base: centos.mirrors.proxad.net
epel: mirrors.coreix.net
epel-debuginfo: mirrors.coreix.net
extras: centos.mirrors.proxad.net
updates: centos.mirrors.proxad.net

==================================================================================================================================================================================================================================== N/S matched: 389-ds =====================================================================================================================================================================================================================================
389-dsgw-debuginfo.x86_64 : Debug information for package 389-dsgw
389-ds.noarch : 389 Directory, Administration, and Console Suite
389-ds-base.x86_64 : 389 Directory Server (base)
389-ds-base-devel.x86_64 : Development libraries for 389 Directory Server
389-ds-base-libs.x86_64 : Core libraries for 389 Directory Server
389-ds-base-snmp.x86_64 : SNMP Agent for 389 Directory Server
389-ds-console.noarch : 389 Directory Server Management Console
389-ds-console-doc.noarch : Web docs for 389 Directory Server Management Console
389-dsgw.x86_64 : 389 Directory Server Gateway (dsgw)
There's no 389-ds-base-debuginfo available. I'm probably missing something ...
Like always, the cluster hanged during the nightly backup. The node running the backup was dead. After restarting it, I tried to disable retroCL triming as per https://bugzilla.redhat.com/show_bug.cgi?id=1751295, but still had a hang during resync. I enabled a few more logs in ds, but at first it doesn't look much helpful right now except that it stops logging anything (including housekeeping stuff) when stale.
/Sylvain.
Le lun. 21 oct. 2019 à 06:00, Jared Ledvina jared@techsmix.net a écrit :
...
__
Hi Sylvain,
I believe we had a similar issue in our configuration. I can dig in more tomorrow but, we had deadlocks with the retroCL plugin.
If you follow the steps outlined on this page, https://directory.fedoraproject.org/docs/389ds/FAQ/faq.html#debug_hangs to get a stack trace, I can try to see if you're hitting the same thing.
See https://bugzilla.redhat.com/show_bug.cgi?id=1751295 for some more details on the issue.
Hope that helps,
Jared
On Sun, Oct 20, 2019, at 3:55 PM, Sylvain Coutant via FreeIPA-users wrote:
...
Hello gurus,
We are running a 3 nodes FreeIPA cluster for some time without major trouble. One server may stale from time to time, without real trouble to restart it.
A few days ago, we had to migrate the VMs between two clouds (disk image copied from one to the other). They have been renumbered from old to new IPv4 address space. Not that easy, but we finally got it done with all DNS entries in sync. Yet, since the migration, ns-slapd process hangs randomly way more often than before (went from once every few months to several times a day) and is especially hard to restart on any node.
While starting up, the netstat output is like:
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp6 184527 0 10.217.151.3:389 10.217.151.2:52314 ESTABLISHED 29948/ns-slapd
Netstat and tcpdump show it processes very slowly the recvq (sometimes like 79 bytes per 1-2 seconds). At some point it just stops processing it and hangs (only kill -9 works to take it down). When stale, strace shows the process loops only on :
getpeername(8, 0x7ffe62c49fd0, 0x7ffe62c49f94) = -1 ENOTCONN (Transport endpoint is not connected)
poll([{fd=50, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=117, events=POLLIN}, {fd=116, events=POLLIN}, {fd=115, events=POLLIN}, {fd=114, events=POLLIN}, {fd=89, events=POLLIN}, {fd=85, events=POLLIN}, {fd=83, events=POLLIN}, {fd=82, events=POLLIN}, {fd=81, events=POLLIN}, {fd=80, events=POLLIN}, {fd=79, events=POLLIN}, {fd=78, events=POLLIN}, {fd=77, events=POLLIN}, {fd=76, events=POLLIN}, {fd=67, events=POLLIN}, {fd=72, events=POLLIN}, {fd=69, events=POLLIN}, {fd=64, events=POLLIN}, {fd=66, events=POLLIN}], 23, 250) = 0 (Timeout)
If it can go through startup replication, one of the server will hang a little bit later, freezing the whole cluster. Forcing us to restart the faulty node to unlock things.
When stale, the dirsrv access log only contains entries like:
[20/Oct/2019:17:52:46.950029525 +0100] conn=86 fd=131 slot=131 connection from 10.217.151.4 to 10.217.151.4
[20/Oct/2019:17:52:51.280412883 +0100] conn=87 fd=132 slot=132 SSL connection from 10.217.151.10 to 10.217.151.4
[20/Oct/2019:17:52:54.956204031 +0100] conn=88 fd=133 slot=133 connection from 10.217.151.4 to 10.217.151.4
[20/Oct/2019:17:53:04.966542441 +0100] conn=89 fd=134 slot=134 connection from 10.217.151.2 to 10.217.151.4
[20/Oct/2019:17:53:22.659053020 +0100] conn=90 fd=135 slot=135 SSL connection from 10.217.151.10 to 10.217.151.4
[20/Oct/2019:17:53:51.006707605 +0100] conn=91 fd=136 slot=136 connection from 10.217.151.4 to 10.217.151.4
[20/Oct/2019:17:53:54.514162543 +0100] conn=92 fd=137 slot=137 SSL connection from 10.217.151.10 to 10.217.151.4
[20/Oct/2019:17:53:59.011602776 +0100] conn=93 fd=138 slot=138 connection from 10.217.151.3 to 10.217.151.4
[20/Oct/2019:17:54:09.019296900 +0100] conn=94 fd=139 slot=139 connection from 10.217.151.4 to 10.217.151.4
And netstat lists 10s of accepted network connections that are stale like :
tcp6 286 0 10.217.151.4:389 10.217.151.10:32512 ESTABLISHED 29948/ns-slapd
The underlying network seams clean and uses jumbo frames. tcpdump and ping show 0 packet loss and no retransmit. Being afraid it could be a jumbo frame issue, mtu was even forced down to 1500. Without success.
Entropy seems fine as well :
# cat /proc/sys/kernel/random/entropy_avail 
3138
Running version on all servers:
ipa-client-4.6.5-11.el7.centos.x86_64
ipa-client-common-4.6.5-11.el7.centos.noarch
ipa-common-4.6.5-11.el7.centos.noarch
ipa-server-4.6.5-11.el7.centos.x86_64
ipa-server-common-4.6.5-11.el7.centos.noarch
ipa-server-dns-4.6.5-11.el7.centos.noarch
I'd happily listen to any hint regarding this critical problem.
/Sylvain.
_______________________________________________
FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org
To unsubscribe send an email to freeipa-users-leave@lists.fedorahosted.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahoste...