login hangs with enumerate = true

List overview All Threads
Download

newer

older

Unable to get accounts from...

Re: SSSD: Cross Forest AD Trust...

Joakim Tjernlund

9 Jun 2017 9 Jun '17

11:28 a.m.

both 1.15.2 and git master hangs after less than 24 hour on a server.

I can see this repeating the domain log:

(Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb

Ideas?

Jocke

Show replies by date

Jakub Hrozek

10 Jun 10 Jun

1:24 a.m.

On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote:

...

both 1.15.2 and git master hangs after less than 24 hour on a server.

I can see this repeating the domain log:

(Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb

This is caused by too long write to disk.

...

Ideas?

Disable enumeration or move the cache to tmpfs. Enumeration won't work well with large domains, sorry.

Joakim Tjernlund

2:56 a.m.

On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote:

...

On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote:

...
both 1.15.2 and git master hangs after less than 24 hour on a server.

I can see this repeating the domain log:

(Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb

This is caused by too long write to disk.

Can I just increase the timeout for now? I will patch the code if needed. On this sever we need enumerate = true ATM, cannot just turn it off.

...

...
Ideas?

Disable enumeration or move the cache to tmpfs. Enumeration won't work well with large domains, sorry.

And never will?

Jocke

Jakub Hrozek

11 Jun 11 Jun

1:55 p.m.

On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote:

...

On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote:

...
On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote:

...
both 1.15.2 and git master hangs after less than 24 hour on a server.

I can see this repeating the domain log:

(Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb

This is caused by too long write to disk.

Can I just increase the timeout for now? I will patch the code if needed. On this sever we need enumerate = true ATM, cannot just turn it off.

Oh, sure. The other alternative might be to mount the cache to tmpfs.

...

...
...
Ideas?

Disable enumeration or move the cache to tmpfs. Enumeration won't work well with large domains, sorry.

And never will?

We are doing incremental performance improvements. There is a round planned for the next upstream version, I'm afraid we don't have any patches yet, but me and Sumit have been throwing ideas around, so we already know what to do. But please keep in mind that enumerating a large forest amounts to keeping a local replica which is going to be costly..

John Hodrien

12 Jun 12 Jun

3:19 a.m.

On Sun, 11 Jun 2017, Jakub Hrozek wrote:

...

Oh, sure. The other alternative might be to mount the cache to tmpfs.

I'm an advocate of this method. With older versions of SSSD, against our relatively large AD, the performance boost from running with tmpfs was immense. This advantage has been reducing over time, as a normally configured SSSD's performance has improved greatly in our configuration.

Joakim Tjernlund

3:29 a.m.

On Mon, 2017-06-12 at 09:19 +0100, John Hodrien wrote:

...

On Sun, 11 Jun 2017, Jakub Hrozek wrote:

...
Oh, sure. The other alternative might be to mount the cache to tmpfs.

I'm an advocate of this method. With older versions of SSSD, against our relatively large AD, the performance boost from running with tmpfs was immense. This advantage has been reducing over time, as a normally configured SSSD's performance has improved greatly in our configuration.

Testing this now. It is a bit strange that even if you have enumerate = true, the first time I do getent group it pauses for a little while even if I wait a few mins for the cache to populate.

Jocke

Jakub Hrozek

4:10 a.m.

On Mon, Jun 12, 2017 at 08:29:29AM +0000, Joakim Tjernlund wrote:

...

On Mon, 2017-06-12 at 09:19 +0100, John Hodrien wrote:

...
On Sun, 11 Jun 2017, Jakub Hrozek wrote:

...
Oh, sure. The other alternative might be to mount the cache to tmpfs.

I'm an advocate of this method. With older versions of SSSD, against our relatively large AD, the performance boost from running with tmpfs was immense. This advantage has been reducing over time, as a normally configured SSSD's performance has improved greatly in our configuration.

Testing this now. It is a bit strange that even if you have enumerate = true, the first time I do getent group it pauses for a little while even if I wait a few mins for the cache to populate.

sssd is blocking for the very first enumeration to avoid replying no or partial result

Joakim Tjernlund

8:43 a.m.

On Mon, 2017-06-12 at 10:29 +0200, Joakim Tjernlund wrote:

...

On Mon, 2017-06-12 at 09:19 +0100, John Hodrien wrote:

...
On Sun, 11 Jun 2017, Jakub Hrozek wrote:

...
Oh, sure. The other alternative might be to mount the cache to tmpfs.

I'm an advocate of this method. With older versions of SSSD, against our relatively large AD, the performance boost from running with tmpfs was immense. This advantage has been reducing over time, as a normally configured SSSD's performance has improved greatly in our configuration.

Testing this now. It is a bit strange that even if you have enumerate = true, the first time I do getent group it pauses for a little while even if I wait a few mins for the cache to populate.

hmm, isn't "offline" login creds stored here as well? Then having a RAM fs will delete the offline cred's each reboot. Is there a way around this?

Jocke

John Hodrien

8:47 a.m.

On Mon, 12 Jun 2017, Joakim Tjernlund wrote:

...

hmm, isn't "offline" login creds stored here as well? Then having a RAM fs will delete the offline cred's each reboot. Is there a way around this?

You could sync it elsewhere on shutdown perhaps?

So far we've got away with not using tmpfs on machines that need stored credentials.

Joakim Tjernlund

8:53 a.m.

On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote:

...

On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote:

...
On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote:

...
On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote:

...
both 1.15.2 and git master hangs after less than 24 hour on a server.

I can see this repeating the domain log:

(Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb

This is caused by too long write to disk.

Can I just increase the timeout for now? I will patch the code if needed. On this sever we need enumerate = true ATM, cannot just turn it off.

Oh, sure. The other alternative might be to mount the cache to tmpfs.

After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. Seems to an additional problem here.

I don't this AD is that big either: # > getent passwd | wc -l 3236 # > getent group | wc -l 885

Any ideas?

Jocke

Jakub Hrozek

9:01 a.m.

On Mon, Jun 12, 2017 at 01:53:27PM +0000, Joakim Tjernlund wrote:

...

On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote:

...
On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote:

...
On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote:

...
On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote:

...
both 1.15.2 and git master hangs after less than 24 hour on a server.

I can see this repeating the domain log:

(Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb

This is caused by too long write to disk.

Can I just increase the timeout for now? I will patch the code if needed. On this sever we need enumerate = true ATM, cannot just turn it off.

Oh, sure. The other alternative might be to mount the cache to tmpfs.

After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. Seems to an additional problem here.

I don't this AD is that big either: # > getent passwd | wc -l 3236 # > getent group | wc -l 885

Any ideas?

Can you get a pstack of when the process is 'stuck' ?

Does increasing the 'timeout' parameter from its default '10' to maybe 30 in the domain section help?

Joakim Tjernlund

10:21 a.m.

On Mon, 2017-06-12 at 16:01 +0200, Jakub Hrozek wrote:

...

On Mon, Jun 12, 2017 at 01:53:27PM +0000, Joakim Tjernlund wrote:

...
On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote:

...
On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote:

...
On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote:

...
On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote:

...
both 1.15.2 and git master hangs after less than 24 hour on a server.

I can see this repeating the domain log:

(Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb

This is caused by too long write to disk.

Can I just increase the timeout for now? I will patch the code if needed. On this sever we need enumerate = true ATM, cannot just turn it off.

Oh, sure. The other alternative might be to mount the cache to tmpfs.

After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. Seems to an additional problem here.

I don't this AD is that big either: # > getent passwd | wc -l 3236 # > getent group | wc -l 885

Any ideas?

Can you get a pstack of when the process is 'stuck' ?

Don't know what pstack is ?

...

Does increasing the 'timeout' parameter from its default '10' to maybe 30 in the domain section help?

will try ..

Jakub Hrozek

10:57 a.m.

On Mon, Jun 12, 2017 at 03:21:43PM +0000, Joakim Tjernlund wrote:

...

On Mon, 2017-06-12 at 16:01 +0200, Jakub Hrozek wrote:

...
On Mon, Jun 12, 2017 at 01:53:27PM +0000, Joakim Tjernlund wrote:

...
On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote:

...
On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote:

...
On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote:

...
On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote: > both 1.15.2 and git master hangs after less than 24 hour on > a server. > > I can see this repeating the domain log: > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb

This is caused by too long write to disk.

Can I just increase the timeout for now? I will patch the code if needed. On this sever we need enumerate = true ATM, cannot just turn it off.

Oh, sure. The other alternative might be to mount the cache to tmpfs.

After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. Seems to an additional problem here.

I don't this AD is that big either: # > getent passwd | wc -l 3236 # > getent group | wc -l 885

Any ideas?

Can you get a pstack of when the process is 'stuck' ?

Don't know what pstack is ?

Sorry, it's a utility that prints the backtrace of a process, e.g.: pstack $(pidof sssd_be) #0 0x00007f5fa5ae9db3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84 #1 0x00007f5fa61ca8ca in epoll_event_loop (tvalp=0x7ffd78977bf0, epoll_ev=0xb44e70) at ../tevent_epoll.c:642 #2 epoll_event_loop_once (ev=<optimized out>, location=<optimized out>) at ../tevent_epoll.c:926 #3 0x00007f5fa61c8f0a in std_event_loop_once (ev=0xb44c30, location=0x7f5faa19cbbd "/sssd/src/util/server.c:719") at ../tevent_standard.c:114 #4 0x00007f5fa61c50e0 in _tevent_loop_once (ev=ev@entry=0xb44c30, location=location@entry=0x7f5faa19cbbd "/sssd/src/util/server.c:719") at ../tevent.c:533 #5 0x00007f5fa61c527b in tevent_common_loop_wait (ev=0xb44c30, location=0x7f5faa19cbbd "/sssd/src/util/server.c:719") at ../tevent.c:637 #6 0x00007f5fa61c8e9a in std_event_loop_wait (ev=0xb44c30, location=0x7f5faa19cbbd "/sssd/src/util/server.c:719") at ../tevent_standard.c:140 #7 0x00007f5faa173f10 in server_loop (main_ctx=0xb46080) at /sssd/src/util/server.c:719 #8 0x00000000004093ff in main (argc=8, argv=0x7ffd78978028) at /sssd/src/providers/data_provider_be.c:589

I don't know about Gentoo, but on RHEL/Fedora, it's part of the gdb package.

Joakim Tjernlund

11:12 a.m.

On Mon, 2017-06-12 at 17:57 +0200, Jakub Hrozek wrote:

...

On Mon, Jun 12, 2017 at 03:21:43PM +0000, Joakim Tjernlund wrote:

...
On Mon, 2017-06-12 at 16:01 +0200, Jakub Hrozek wrote:

...
On Mon, Jun 12, 2017 at 01:53:27PM +0000, Joakim Tjernlund wrote:

...
On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote:

...
On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote:

...
On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote: > On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote: > > both 1.15.2 and git master hangs after less than 24 hour on > > a server. > > > > I can see this repeating the domain log: > > > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb > > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > This is caused by too long write to disk. >

Can I just increase the timeout for now? I will patch the code if needed. On this sever we need enumerate = true ATM, cannot just turn it off.

Oh, sure. The other alternative might be to mount the cache to tmpfs.

After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. Seems to an additional problem here.

I don't this AD is that big either: # > getent passwd | wc -l 3236 # > getent group | wc -l 885

Any ideas?

Can you get a pstack of when the process is 'stuck' ?

Don't know what pstack is ?

Sorry, it's a utility that prints the backtrace of a process, e.g.: pstack $(pidof sssd_be) #0 0x00007f5fa5ae9db3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84 #1 0x00007f5fa61ca8ca in epoll_event_loop (tvalp=0x7ffd78977bf0, epoll_ev=0xb44e70) at ../tevent_epoll.c:642 #2 epoll_event_loop_once (ev=<optimized out>, location=<optimized out>) at ../tevent_epoll.c:926 #3 0x00007f5fa61c8f0a in std_event_loop_once (ev=0xb44c30, location=0x7f5faa19cbbd "/sssd/src/util/server.c:719") at ../tevent_standard.c:114 #4 0x00007f5fa61c50e0 in _tevent_loop_once (ev=ev@entry=0xb44c30, location=location@entry=0x7f5faa19cbbd "/sssd/src/util/server.c:719") at ../tevent.c:533 #5 0x00007f5fa61c527b in tevent_common_loop_wait (ev=0xb44c30, location=0x7f5faa19cbbd "/sssd/src/util/server.c:719") at ../tevent.c:637 #6 0x00007f5fa61c8e9a in std_event_loop_wait (ev=0xb44c30, location=0x7f5faa19cbbd "/sssd/src/util/server.c:719") at ../tevent_standard.c:140 #7 0x00007f5faa173f10 in server_loop (main_ctx=0xb46080) at /sssd/src/util/server.c:719 #8 0x00000000004093ff in main (argc=8, argv=0x7ffd78978028) at /sssd/src/providers/data_provider_be.c:589

I don't know about Gentoo, but on RHEL/Fedora, it's part of the gdb package.

I see, its not in native Gentoo but can be found in extarnal overlays. Not sure this will help though as sssd is burning CPU when it gets into this state.

Jocke

Joakim Tjernlund

10:32 a.m.

On Mon, 2017-06-12 at 16:01 +0200, Jakub Hrozek wrote:

...

On Mon, Jun 12, 2017 at 01:53:27PM +0000, Joakim Tjernlund wrote:

...
On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote:

...
On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote:

...
On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote:

...
On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote:

...
both 1.15.2 and git master hangs after less than 24 hour on a server.

I can see this repeating the domain log:

(Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb

This is caused by too long write to disk.

Can I just increase the timeout for now? I will patch the code if needed. On this sever we need enumerate = true ATM, cannot just turn it off.

Oh, sure. The other alternative might be to mount the cache to tmpfs.

After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. Seems to an additional problem here.

I don't this AD is that big either: # > getent passwd | wc -l 3236 # > getent group | wc -l 885

Any ideas?

Can you get a pstack of when the process is 'stuck' ?

Does increasing the 'timeout' parameter from its default '10' to maybe 30 in the domain section help?

I see ALOT of this in the log( figured I look before I restart sssd)

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [sdap_find_entry_by_origDN] (0x4000): Searching cache for [CN=Jovy\20Sena,OU=Sunnyvale,OU=CorpUsers,DC=infinera,DC=com]. (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x4c28c00

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x4c28cc0

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x4c28c00 "ltdb_callback"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x4c28cc0 "ltdb_timeout"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x4c28c00 "ltdb_callback"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x34ccf50

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x34cd0c0

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x34ccf50 "ltdb_callback"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x34cd0c0 "ltdb_timeout"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x34ccf50 "ltdb_callback"

Joakim Tjernlund

10:38 a.m.

On Mon, 2017-06-12 at 17:32 +0200, Joakim Tjernlund wrote:

...

On Mon, 2017-06-12 at 16:01 +0200, Jakub Hrozek wrote:

...
On Mon, Jun 12, 2017 at 01:53:27PM +0000, Joakim Tjernlund wrote:

...
On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote:

...
On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote:

...
On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote:

...
On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote: > both 1.15.2 and git master hangs after less than 24 hour on > a server. > > I can see this repeating the domain log: > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb

This is caused by too long write to disk.

Can I just increase the timeout for now? I will patch the code if needed. On this sever we need enumerate = true ATM, cannot just turn it off.

Oh, sure. The other alternative might be to mount the cache to tmpfs.

After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. Seems to an additional problem here.

I don't this AD is that big either: # > getent passwd | wc -l 3236 # > getent group | wc -l 885

Any ideas?

Can you get a pstack of when the process is 'stuck' ?

Does increasing the 'timeout' parameter from its default '10' to maybe 30 in the domain section help?

I see ALOT of this in the log( figured I look before I restart sssd)

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [sdap_find_entry_by_origDN] (0x4000): Searching cache for [CN=Jovy\20Sena,OU=Sunnyvale,OU=CorpUsers,DC=infinera,DC=com]. (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x4c28c00

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x4c28cc0

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x4c28c00 "ltdb_callback"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x4c28cc0 "ltdb_timeout"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x4c28c00 "ltdb_callback"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x34ccf50

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x34cd0c0

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x34ccf50 "ltdb_callback"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x34cd0c0 "ltdb_timeout"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x34ccf50 "ltdb_callback"

After just adding timout = 30 and restarting sssd it still hung. Had to clear out(saved a copy first) the sssd cache as well for normal function.

Jocke

Jakub Hrozek

10:51 a.m.

On Mon, Jun 12, 2017 at 03:38:28PM +0000, Joakim Tjernlund wrote:

...

On Mon, 2017-06-12 at 17:32 +0200, Joakim Tjernlund wrote:

...
On Mon, 2017-06-12 at 16:01 +0200, Jakub Hrozek wrote:

...
On Mon, Jun 12, 2017 at 01:53:27PM +0000, Joakim Tjernlund wrote:

...
On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote:

...
On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote:

...
On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote: > On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote: > > both 1.15.2 and git master hangs after less than 24 hour on > > a server. > > > > I can see this repeating the domain log: > > > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb > > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > This is caused by too long write to disk. >

Can I just increase the timeout for now? I will patch the code if needed. On this sever we need enumerate = true ATM, cannot just turn it off.

Oh, sure. The other alternative might be to mount the cache to tmpfs.

After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. Seems to an additional problem here.

I don't this AD is that big either: # > getent passwd | wc -l 3236 # > getent group | wc -l 885

Any ideas?

Can you get a pstack of when the process is 'stuck' ?

Does increasing the 'timeout' parameter from its default '10' to maybe 30 in the domain section help?

I see ALOT of this in the log( figured I look before I restart sssd)

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [sdap_find_entry_by_origDN] (0x4000): Searching cache for [CN=Jovy\20Sena,OU=Sunnyvale,OU=CorpUsers,DC=infinera,DC=com]. (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x4c28c00

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x4c28cc0

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x4c28c00 "ltdb_callback"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x4c28cc0 "ltdb_timeout"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x4c28c00 "ltdb_callback"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x34ccf50

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x34cd0c0

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x34ccf50 "ltdb_callback"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x34cd0c0 "ltdb_timeout"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x34ccf50 "ltdb_callback"

After just adding timout = 30 and restarting sssd it still hung. Had to clear out(saved a copy first)

^^^^^^^^^^^ There is a typo here, I wonder if you used the correct spelling in the config? Also, did you add the option to the domain section?

...

the sssd cache as well for normal function.

Jocke _______________________________________________ sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org

Joakim Tjernlund

11:06 a.m.

On Mon, 2017-06-12 at 17:51 +0200, Jakub Hrozek wrote:

...

On Mon, Jun 12, 2017 at 03:38:28PM +0000, Joakim Tjernlund wrote:

...
On Mon, 2017-06-12 at 17:32 +0200, Joakim Tjernlund wrote:

...
On Mon, 2017-06-12 at 16:01 +0200, Jakub Hrozek wrote:

...
On Mon, Jun 12, 2017 at 01:53:27PM +0000, Joakim Tjernlund wrote:

...
On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote:

...
On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote: > On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote: > > On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote: > > > both 1.15.2 and git master hangs after less than 24 hour on > > > a server. > > > > > > I can see this repeating the domain log: > > > > > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > > This is caused by too long write to disk. > > > > Can I just increase the timeout for now? I will patch the code if needed. > On this sever we need enumerate = true ATM, cannot just turn it off.

Oh, sure. The other alternative might be to mount the cache to tmpfs.

After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. Seems to an additional problem here.

I don't this AD is that big either: # > getent passwd | wc -l 3236 # > getent group | wc -l 885

Any ideas?

Can you get a pstack of when the process is 'stuck' ?

Does increasing the 'timeout' parameter from its default '10' to maybe 30 in the domain section help?

I see ALOT of this in the log( figured I look before I restart sssd)

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [sdap_find_entry_by_origDN] (0x4000): Searching cache for [CN=Jovy\20Sena,OU=Sunnyvale,OU=CorpUsers,DC=infinera,DC=com]. (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x4c28c00

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x4c28cc0

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x4c28c00 "ltdb_callback"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x4c28cc0 "ltdb_timeout"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x4c28c00 "ltdb_callback"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x34ccf50

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x34cd0c0

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x34ccf50 "ltdb_callback"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x34cd0c0 "ltdb_timeout"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x34ccf50 "ltdb_callback"

After just adding timout = 30 and restarting sssd it still hung. Had to clear out(saved a copy first)
                ^^^^^^^^^^^
There is a typo here, I wonder if you used the correct spelling in the config? Also, did you add the option to the domain section?

It is now :) was in the wrong section before

Joakim Tjernlund

13 Jun 13 Jun

7:12 a.m.

On Mon, 2017-06-12 at 18:06 +0200, Joakim Tjernlund wrote:

...

On Mon, 2017-06-12 at 17:51 +0200, Jakub Hrozek wrote:

...
On Mon, Jun 12, 2017 at 03:38:28PM +0000, Joakim Tjernlund wrote:

...
On Mon, 2017-06-12 at 17:32 +0200, Joakim Tjernlund wrote:

...
On Mon, 2017-06-12 at 16:01 +0200, Jakub Hrozek wrote:

...
On Mon, Jun 12, 2017 at 01:53:27PM +0000, Joakim Tjernlund wrote:

...
On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote: > On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote: > > On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote: > > > On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote: > > > > both 1.15.2 and git master hangs after less than 24 hour on > > > > a server. > > > > > > > > I can see this repeating the domain log: > > > > > > > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > > > > This is caused by too long write to disk. > > > > > > > Can I just increase the timeout for now? I will patch the code if needed. > > On this sever we need enumerate = true ATM, cannot just turn it off. > > Oh, sure. The other alternative might be to mount the cache to tmpfs.

After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. Seems to an additional problem here.

I don't this AD is that big either: # > getent passwd | wc -l 3236 # > getent group | wc -l 885

Any ideas?

Can you get a pstack of when the process is 'stuck' ?

Does increasing the 'timeout' parameter from its default '10' to maybe 30 in the domain section help?

I see ALOT of this in the log( figured I look before I restart sssd)

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [sdap_find_entry_by_origDN] (0x4000): Searching cache for [CN=Jovy\20Sena,OU=Sunnyvale,OU=CorpUsers,DC=infinera,DC=com]. (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x4c28c00

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x4c28cc0

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x4c28c00 "ltdb_callback"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x4c28cc0 "ltdb_timeout"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x4c28c00 "ltdb_callback"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x34ccf50

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x34cd0c0

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x34ccf50 "ltdb_callback"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x34cd0c0 "ltdb_timeout"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x34ccf50 "ltdb_callback"

After just adding timout = 30 and restarting sssd it still hung. Had to clear out(saved a copy first)
                ^^^^^^^^^^^
There is a typo here, I wonder if you used the correct spelling in the config? Also, did you add the option to the domain section?
It is now :) was in the wrong section before

timeout = 30 in domain section SEEMS to help, no problem since yesterday. What did I really do here?

Jocke

Joakim Tjernlund

7:34 a.m.

On Tue, 2017-06-13 at 14:12 +0200, Joakim Tjernlund wrote:

...

On Mon, 2017-06-12 at 18:06 +0200, Joakim Tjernlund wrote:

...
On Mon, 2017-06-12 at 17:51 +0200, Jakub Hrozek wrote:

...
On Mon, Jun 12, 2017 at 03:38:28PM +0000, Joakim Tjernlund wrote:

...
On Mon, 2017-06-12 at 17:32 +0200, Joakim Tjernlund wrote:

...
On Mon, 2017-06-12 at 16:01 +0200, Jakub Hrozek wrote:

...
On Mon, Jun 12, 2017 at 01:53:27PM +0000, Joakim Tjernlund wrote: > On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote: > > On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote: > > > On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote: > > > > On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote: > > > > > both 1.15.2 and git master hangs after less than 24 hour on > > > > > a server. > > > > > > > > > > I can see this repeating the domain log: > > > > > > > > > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > > > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > > > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > > > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > > > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > > > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > > > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > > > > > > This is caused by too long write to disk. > > > > > > > > > > Can I just increase the timeout for now? I will patch the code if needed. > > > On this sever we need enumerate = true ATM, cannot just turn it off. > > > > Oh, sure. The other alternative might be to mount the cache to tmpfs. > > After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. > Seems to an additional problem here. > > I don't this AD is that big either: > # > getent passwd | wc -l > 3236 > # > getent group | wc -l > 885 > > Any ideas?

Can you get a pstack of when the process is 'stuck' ?

Does increasing the 'timeout' parameter from its default '10' to maybe 30 in the domain section help?

I see ALOT of this in the log( figured I look before I restart sssd)

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [sdap_find_entry_by_origDN] (0x4000): Searching cache for [CN=Jovy\20Sena,OU=Sunnyvale,OU=CorpUsers,DC=infinera,DC=com]. (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x4c28c00

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x4c28cc0

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x4c28c00 "ltdb_callback"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x4c28cc0 "ltdb_timeout"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x4c28c00 "ltdb_callback"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x34ccf50

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x34cd0c0

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x34ccf50 "ltdb_callback"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x34cd0c0 "ltdb_timeout"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x34ccf50 "ltdb_callback"

After just adding timout = 30 and restarting sssd it still hung. Had to clear out(saved a copy first)
                ^^^^^^^^^^^
There is a typo here, I wonder if you used the correct spelling in the config? Also, did you add the option to the domain section?
It is now :) was in the wrong section before
timeout = 30 in domain section SEEMS to help, no problem since yesterday. What did I really do here?

However, now I see that getent group/getent group <a-grp-name> is incomplete, members are missing. And it varies between machines, even ones that have enumerate = false has incomplete member list for a random grop name.

Jocke

Jakub Hrozek

10:59 a.m.

On Tue, Jun 13, 2017 at 12:34:41PM +0000, Joakim Tjernlund wrote:

...

...
timeout = 30 in domain section SEEMS to help, no problem since yesterday. What did I really do here?

However, now I see that getent group/getent group <a-grp-name> is incomplete, members are missing. And it varies between machines, even ones that have enumerate = false has incomplete member list for a random grop name.

Bug-whack-a-mole probably: https://pagure.io/SSSD/sssd/issue/3369 please check the debug logs if there are messages from the "cleanup task".

Joakim Tjernlund

1:18 p.m.

On Tue, 2017-06-13 at 17:59 +0200, Jakub Hrozek wrote:

...

On Tue, Jun 13, 2017 at 12:34:41PM +0000, Joakim Tjernlund wrote:

...
...
timeout = 30 in domain section SEEMS to help, no problem since yesterday. What did I really do here?

However, now I see that getent group/getent group <a-grp-name> is incomplete, members are missing. And it varies between machines, even ones that have enumerate = false has incomplete member list for a random grop name.

Bug-whack-a-mole probably: https://pagure.io/SSSD/sssd/issue/3369 please check the debug logs if there are messages from the "cleanup task".

Nothing in the logs, what debug level do I need to see this?

Jakub Hrozek

14 Jun 14 Jun

6:48 a.m.

On Tue, Jun 13, 2017 at 06:18:24PM +0000, Joakim Tjernlund wrote:

...

On Tue, 2017-06-13 at 17:59 +0200, Jakub Hrozek wrote:

...
On Tue, Jun 13, 2017 at 12:34:41PM +0000, Joakim Tjernlund wrote:

...
...
timeout = 30 in domain section SEEMS to help, no problem since yesterday. What did I really do here?

However, now I see that getent group/getent group <a-grp-name> is incomplete, members are missing. And it varies between machines, even ones that have enumerate = false has incomplete member list for a random grop name.

Bug-whack-a-mole probably: https://pagure.io/SSSD/sssd/issue/3369 please check the debug logs if there are messages from the "cleanup task".

Nothing in the logs, what debug level do I need to see this?

5 or higher.

Jakub Hrozek

13 Jun 13 Jun

11:01 a.m.

On Tue, Jun 13, 2017 at 12:12:05PM +0000, Joakim Tjernlund wrote:

...

...
It is now :) was in the wrong section before

timeout = 30 in domain section SEEMS to help, no problem since yesterday. What did I really do here?

There is a ticket to document this better already but tl;dr there is a watchdog that, unless during three ticks of the 'timeout' value, an internal event is received that resets the watchdog, kills the process, because the process is presumed stuck.

What happens when sssd writes so many entries to the cache is that the write operations blocks the event loop, prevents the delivery of the watchdog reset which results in killing of the process.

Joakim Tjernlund

1:21 p.m.

On Tue, 2017-06-13 at 18:01 +0200, Jakub Hrozek wrote:

...

On Tue, Jun 13, 2017 at 12:12:05PM +0000, Joakim Tjernlund wrote:

...
...
It is now :) was in the wrong section before

timeout = 30 in domain section SEEMS to help, no problem since yesterday. What did I really do here?

There is a ticket to document this better already but tl;dr there is a watchdog that, unless during three ticks of the 'timeout' value, an internal event is received that resets the watchdog, kills the process, because the process is presumed stuck.

What happens when sssd writes so many entries to the cache is that the write operations blocks the event loop, prevents the delivery of the watchdog reset which results in killing of the process.

hmm, on a tmpfs 3*10 secs should be more that enough for that I think. Also, the process(the domain process) was never dead but eating CPU instead.

Jocke

Jakub Hrozek

14 Jun 14 Jun

6:49 a.m.

On Tue, Jun 13, 2017 at 06:21:28PM +0000, Joakim Tjernlund wrote:

...

On Tue, 2017-06-13 at 18:01 +0200, Jakub Hrozek wrote:

...
On Tue, Jun 13, 2017 at 12:12:05PM +0000, Joakim Tjernlund wrote:

...
...
It is now :) was in the wrong section before

timeout = 30 in domain section SEEMS to help, no problem since yesterday. What did I really do here?

There is a ticket to document this better already but tl;dr there is a watchdog that, unless during three ticks of the 'timeout' value, an internal event is received that resets the watchdog, kills the process, because the process is presumed stuck.

What happens when sssd writes so many entries to the cache is that the write operations blocks the event loop, prevents the delivery of the watchdog reset which results in killing of the process.

hmm, on a tmpfs 3*10 secs should be more that enough for that I think. Also, the process(the domain process) was never dead but eating CPU instead.

well, I was not precise earlier, it doesn't have to be writes, but for example the loop you showed checks if all members of a group are cached already or not by searching each member in turn. That is not a write, but can also block the process.

Jakub Hrozek

12 Jun 12 Jun

10:53 a.m.

On Mon, Jun 12, 2017 at 03:32:22PM +0000, Joakim Tjernlund wrote:

...

On Mon, 2017-06-12 at 16:01 +0200, Jakub Hrozek wrote:

...
On Mon, Jun 12, 2017 at 01:53:27PM +0000, Joakim Tjernlund wrote:

...
On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote:

...
On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote:

...
On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote:

...
On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote: > both 1.15.2 and git master hangs after less than 24 hour on > a server. > > I can see this repeating the domain log: > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb

This is caused by too long write to disk.

Can I just increase the timeout for now? I will patch the code if needed. On this sever we need enumerate = true ATM, cannot just turn it off.

Oh, sure. The other alternative might be to mount the cache to tmpfs.

After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. Seems to an additional problem here.

I don't this AD is that big either: # > getent passwd | wc -l 3236 # > getent group | wc -l 885

Any ideas?

Can you get a pstack of when the process is 'stuck' ?

Does increasing the 'timeout' parameter from its default '10' to maybe 30 in the domain section help?

I see ALOT of this in the log( figured I look before I restart sssd)

Right, this is sssd looking up members for a group it is processing. It is one of the pieces we need to refactor in the next version, because the sdap_async_groups.c module can end up looking the same member for the same group several times during a single group-save operation (IIRC, this is from memory when I was working on perf enhancement in the previous version..)

...

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [sdap_find_entry_by_origDN] (0x4000): Searching cache for [CN=Jovy\20Sena,OU=Sunnyvale,OU=CorpUsers,DC=infinera,DC=com]. (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x4c28c00

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x4c28cc0

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x4c28c00 "ltdb_callback"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x4c28cc0 "ltdb_timeout"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x4c28c00 "ltdb_callback"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x34ccf50

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x34cd0c0

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x34ccf50 "ltdb_callback"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x34cd0c0 "ltdb_timeout"

(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x34ccf50 "ltdb_callback" _______________________________________________ sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org

2514

Age (days ago)

2519

Last active (days ago)

sssd-users@lists.fedorahosted.org

26 comments

3 participants

tags (0)

participants (3)

Jakub Hrozek
Joakim Tjernlund
John Hodrien