both 1.15.2 and git master hangs after less than 24 hour on a server.
I can see this repeating the domain log:
(Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb
Ideas?
Jocke
On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote:
both 1.15.2 and git master hangs after less than 24 hour on a server.
I can see this repeating the domain log:
(Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb
This is caused by too long write to disk.
Ideas?
Disable enumeration or move the cache to tmpfs. Enumeration won't work well with large domains, sorry.
On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote:
On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote:
both 1.15.2 and git master hangs after less than 24 hour on a server.
I can see this repeating the domain log:
(Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb
This is caused by too long write to disk.
Can I just increase the timeout for now? I will patch the code if needed. On this sever we need enumerate = true ATM, cannot just turn it off.
Ideas?
Disable enumeration or move the cache to tmpfs. Enumeration won't work well with large domains, sorry.
And never will?
Jocke
On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote:
On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote:
On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote:
both 1.15.2 and git master hangs after less than 24 hour on a server.
I can see this repeating the domain log:
(Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb
This is caused by too long write to disk.
Can I just increase the timeout for now? I will patch the code if needed. On this sever we need enumerate = true ATM, cannot just turn it off.
Oh, sure. The other alternative might be to mount the cache to tmpfs.
Ideas?
Disable enumeration or move the cache to tmpfs. Enumeration won't work well with large domains, sorry.
And never will?
We are doing incremental performance improvements. There is a round planned for the next upstream version, I'm afraid we don't have any patches yet, but me and Sumit have been throwing ideas around, so we already know what to do. But please keep in mind that enumerating a large forest amounts to keeping a local replica which is going to be costly..
On Sun, 11 Jun 2017, Jakub Hrozek wrote:
Oh, sure. The other alternative might be to mount the cache to tmpfs.
I'm an advocate of this method. With older versions of SSSD, against our relatively large AD, the performance boost from running with tmpfs was immense. This advantage has been reducing over time, as a normally configured SSSD's performance has improved greatly in our configuration.
jh
On Mon, 2017-06-12 at 09:19 +0100, John Hodrien wrote:
On Sun, 11 Jun 2017, Jakub Hrozek wrote:
Oh, sure. The other alternative might be to mount the cache to tmpfs.
I'm an advocate of this method. With older versions of SSSD, against our relatively large AD, the performance boost from running with tmpfs was immense. This advantage has been reducing over time, as a normally configured SSSD's performance has improved greatly in our configuration.
Testing this now. It is a bit strange that even if you have enumerate = true, the first time I do getent group it pauses for a little while even if I wait a few mins for the cache to populate.
Jocke
On Mon, Jun 12, 2017 at 08:29:29AM +0000, Joakim Tjernlund wrote:
On Mon, 2017-06-12 at 09:19 +0100, John Hodrien wrote:
On Sun, 11 Jun 2017, Jakub Hrozek wrote:
Oh, sure. The other alternative might be to mount the cache to tmpfs.
I'm an advocate of this method. With older versions of SSSD, against our relatively large AD, the performance boost from running with tmpfs was immense. This advantage has been reducing over time, as a normally configured SSSD's performance has improved greatly in our configuration.
Testing this now. It is a bit strange that even if you have enumerate = true, the first time I do getent group it pauses for a little while even if I wait a few mins for the cache to populate.
sssd is blocking for the very first enumeration to avoid replying no or partial result
On Mon, 2017-06-12 at 10:29 +0200, Joakim Tjernlund wrote:
On Mon, 2017-06-12 at 09:19 +0100, John Hodrien wrote:
On Sun, 11 Jun 2017, Jakub Hrozek wrote:
Oh, sure. The other alternative might be to mount the cache to tmpfs.
I'm an advocate of this method. With older versions of SSSD, against our relatively large AD, the performance boost from running with tmpfs was immense. This advantage has been reducing over time, as a normally configured SSSD's performance has improved greatly in our configuration.
Testing this now. It is a bit strange that even if you have enumerate = true, the first time I do getent group it pauses for a little while even if I wait a few mins for the cache to populate.
hmm, isn't "offline" login creds stored here as well? Then having a RAM fs will delete the offline cred's each reboot. Is there a way around this?
Jocke
On Mon, 12 Jun 2017, Joakim Tjernlund wrote:
hmm, isn't "offline" login creds stored here as well? Then having a RAM fs will delete the offline cred's each reboot. Is there a way around this?
You could sync it elsewhere on shutdown perhaps?
So far we've got away with not using tmpfs on machines that need stored credentials.
jh
On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote:
On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote:
On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote:
On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote:
both 1.15.2 and git master hangs after less than 24 hour on a server.
I can see this repeating the domain log:
(Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb
This is caused by too long write to disk.
Can I just increase the timeout for now? I will patch the code if needed. On this sever we need enumerate = true ATM, cannot just turn it off.
Oh, sure. The other alternative might be to mount the cache to tmpfs.
After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. Seems to an additional problem here.
I don't this AD is that big either: # > getent passwd | wc -l 3236 # > getent group | wc -l 885
Any ideas?
Jocke
On Mon, Jun 12, 2017 at 01:53:27PM +0000, Joakim Tjernlund wrote:
On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote:
On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote:
On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote:
On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote:
both 1.15.2 and git master hangs after less than 24 hour on a server.
I can see this repeating the domain log:
(Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb
This is caused by too long write to disk.
Can I just increase the timeout for now? I will patch the code if needed. On this sever we need enumerate = true ATM, cannot just turn it off.
Oh, sure. The other alternative might be to mount the cache to tmpfs.
After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. Seems to an additional problem here.
I don't this AD is that big either: # > getent passwd | wc -l 3236 # > getent group | wc -l 885
Any ideas?
Can you get a pstack of when the process is 'stuck' ?
Does increasing the 'timeout' parameter from its default '10' to maybe 30 in the domain section help?
On Mon, 2017-06-12 at 16:01 +0200, Jakub Hrozek wrote:
On Mon, Jun 12, 2017 at 01:53:27PM +0000, Joakim Tjernlund wrote:
On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote:
On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote:
On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote:
On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote:
both 1.15.2 and git master hangs after less than 24 hour on a server.
I can see this repeating the domain log:
(Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb
This is caused by too long write to disk.
Can I just increase the timeout for now? I will patch the code if needed. On this sever we need enumerate = true ATM, cannot just turn it off.
Oh, sure. The other alternative might be to mount the cache to tmpfs.
After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. Seems to an additional problem here.
I don't this AD is that big either: # > getent passwd | wc -l 3236 # > getent group | wc -l 885
Any ideas?
Can you get a pstack of when the process is 'stuck' ?
Don't know what pstack is ?
Does increasing the 'timeout' parameter from its default '10' to maybe 30 in the domain section help?
will try ..
On Mon, Jun 12, 2017 at 03:21:43PM +0000, Joakim Tjernlund wrote:
On Mon, 2017-06-12 at 16:01 +0200, Jakub Hrozek wrote:
On Mon, Jun 12, 2017 at 01:53:27PM +0000, Joakim Tjernlund wrote:
On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote:
On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote:
On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote:
On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote: > both 1.15.2 and git master hangs after less than 24 hour on > a server. > > I can see this repeating the domain log: > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb
This is caused by too long write to disk.
Can I just increase the timeout for now? I will patch the code if needed. On this sever we need enumerate = true ATM, cannot just turn it off.
Oh, sure. The other alternative might be to mount the cache to tmpfs.
After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. Seems to an additional problem here.
I don't this AD is that big either: # > getent passwd | wc -l 3236 # > getent group | wc -l 885
Any ideas?
Can you get a pstack of when the process is 'stuck' ?
Don't know what pstack is ?
Sorry, it's a utility that prints the backtrace of a process, e.g.: pstack $(pidof sssd_be) #0 0x00007f5fa5ae9db3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84 #1 0x00007f5fa61ca8ca in epoll_event_loop (tvalp=0x7ffd78977bf0, epoll_ev=0xb44e70) at ../tevent_epoll.c:642 #2 epoll_event_loop_once (ev=<optimized out>, location=<optimized out>) at ../tevent_epoll.c:926 #3 0x00007f5fa61c8f0a in std_event_loop_once (ev=0xb44c30, location=0x7f5faa19cbbd "/sssd/src/util/server.c:719") at ../tevent_standard.c:114 #4 0x00007f5fa61c50e0 in _tevent_loop_once (ev=ev@entry=0xb44c30, location=location@entry=0x7f5faa19cbbd "/sssd/src/util/server.c:719") at ../tevent.c:533 #5 0x00007f5fa61c527b in tevent_common_loop_wait (ev=0xb44c30, location=0x7f5faa19cbbd "/sssd/src/util/server.c:719") at ../tevent.c:637 #6 0x00007f5fa61c8e9a in std_event_loop_wait (ev=0xb44c30, location=0x7f5faa19cbbd "/sssd/src/util/server.c:719") at ../tevent_standard.c:140 #7 0x00007f5faa173f10 in server_loop (main_ctx=0xb46080) at /sssd/src/util/server.c:719 #8 0x00000000004093ff in main (argc=8, argv=0x7ffd78978028) at /sssd/src/providers/data_provider_be.c:589
I don't know about Gentoo, but on RHEL/Fedora, it's part of the gdb package.
On Mon, 2017-06-12 at 17:57 +0200, Jakub Hrozek wrote:
On Mon, Jun 12, 2017 at 03:21:43PM +0000, Joakim Tjernlund wrote:
On Mon, 2017-06-12 at 16:01 +0200, Jakub Hrozek wrote:
On Mon, Jun 12, 2017 at 01:53:27PM +0000, Joakim Tjernlund wrote:
On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote:
On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote:
On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote: > On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote: > > both 1.15.2 and git master hangs after less than 24 hour on > > a server. > > > > I can see this repeating the domain log: > > > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb > > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > This is caused by too long write to disk. >
Can I just increase the timeout for now? I will patch the code if needed. On this sever we need enumerate = true ATM, cannot just turn it off.
Oh, sure. The other alternative might be to mount the cache to tmpfs.
After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. Seems to an additional problem here.
I don't this AD is that big either: # > getent passwd | wc -l 3236 # > getent group | wc -l 885
Any ideas?
Can you get a pstack of when the process is 'stuck' ?
Don't know what pstack is ?
Sorry, it's a utility that prints the backtrace of a process, e.g.: pstack $(pidof sssd_be) #0 0x00007f5fa5ae9db3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84 #1 0x00007f5fa61ca8ca in epoll_event_loop (tvalp=0x7ffd78977bf0, epoll_ev=0xb44e70) at ../tevent_epoll.c:642 #2 epoll_event_loop_once (ev=<optimized out>, location=<optimized out>) at ../tevent_epoll.c:926 #3 0x00007f5fa61c8f0a in std_event_loop_once (ev=0xb44c30, location=0x7f5faa19cbbd "/sssd/src/util/server.c:719") at ../tevent_standard.c:114 #4 0x00007f5fa61c50e0 in _tevent_loop_once (ev=ev@entry=0xb44c30, location=location@entry=0x7f5faa19cbbd "/sssd/src/util/server.c:719") at ../tevent.c:533 #5 0x00007f5fa61c527b in tevent_common_loop_wait (ev=0xb44c30, location=0x7f5faa19cbbd "/sssd/src/util/server.c:719") at ../tevent.c:637 #6 0x00007f5fa61c8e9a in std_event_loop_wait (ev=0xb44c30, location=0x7f5faa19cbbd "/sssd/src/util/server.c:719") at ../tevent_standard.c:140 #7 0x00007f5faa173f10 in server_loop (main_ctx=0xb46080) at /sssd/src/util/server.c:719 #8 0x00000000004093ff in main (argc=8, argv=0x7ffd78978028) at /sssd/src/providers/data_provider_be.c:589
I don't know about Gentoo, but on RHEL/Fedora, it's part of the gdb package.
I see, its not in native Gentoo but can be found in extarnal overlays. Not sure this will help though as sssd is burning CPU when it gets into this state.
Jocke
On Mon, 2017-06-12 at 16:01 +0200, Jakub Hrozek wrote:
On Mon, Jun 12, 2017 at 01:53:27PM +0000, Joakim Tjernlund wrote:
On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote:
On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote:
On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote:
On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote:
both 1.15.2 and git master hangs after less than 24 hour on a server.
I can see this repeating the domain log:
(Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb
This is caused by too long write to disk.
Can I just increase the timeout for now? I will patch the code if needed. On this sever we need enumerate = true ATM, cannot just turn it off.
Oh, sure. The other alternative might be to mount the cache to tmpfs.
After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. Seems to an additional problem here.
I don't this AD is that big either: # > getent passwd | wc -l 3236 # > getent group | wc -l 885
Any ideas?
Can you get a pstack of when the process is 'stuck' ?
Does increasing the 'timeout' parameter from its default '10' to maybe 30 in the domain section help?
I see ALOT of this in the log( figured I look before I restart sssd)
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [sdap_find_entry_by_origDN] (0x4000): Searching cache for [CN=Jovy\20Sena,OU=Sunnyvale,OU=CorpUsers,DC=infinera,DC=com]. (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x4c28c00
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x4c28cc0
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x4c28c00 "ltdb_callback"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x4c28cc0 "ltdb_timeout"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x4c28c00 "ltdb_callback"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x34ccf50
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x34cd0c0
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x34ccf50 "ltdb_callback"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x34cd0c0 "ltdb_timeout"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x34ccf50 "ltdb_callback"
On Mon, 2017-06-12 at 17:32 +0200, Joakim Tjernlund wrote:
On Mon, 2017-06-12 at 16:01 +0200, Jakub Hrozek wrote:
On Mon, Jun 12, 2017 at 01:53:27PM +0000, Joakim Tjernlund wrote:
On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote:
On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote:
On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote:
On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote: > both 1.15.2 and git master hangs after less than 24 hour on > a server. > > I can see this repeating the domain log: > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb
This is caused by too long write to disk.
Can I just increase the timeout for now? I will patch the code if needed. On this sever we need enumerate = true ATM, cannot just turn it off.
Oh, sure. The other alternative might be to mount the cache to tmpfs.
After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. Seems to an additional problem here.
I don't this AD is that big either: # > getent passwd | wc -l 3236 # > getent group | wc -l 885
Any ideas?
Can you get a pstack of when the process is 'stuck' ?
Does increasing the 'timeout' parameter from its default '10' to maybe 30 in the domain section help?
I see ALOT of this in the log( figured I look before I restart sssd)
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [sdap_find_entry_by_origDN] (0x4000): Searching cache for [CN=Jovy\20Sena,OU=Sunnyvale,OU=CorpUsers,DC=infinera,DC=com]. (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x4c28c00
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x4c28cc0
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x4c28c00 "ltdb_callback"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x4c28cc0 "ltdb_timeout"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x4c28c00 "ltdb_callback"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x34ccf50
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x34cd0c0
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x34ccf50 "ltdb_callback"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x34cd0c0 "ltdb_timeout"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x34ccf50 "ltdb_callback"
After just adding timout = 30 and restarting sssd it still hung. Had to clear out(saved a copy first) the sssd cache as well for normal function.
Jocke
On Mon, Jun 12, 2017 at 03:38:28PM +0000, Joakim Tjernlund wrote:
On Mon, 2017-06-12 at 17:32 +0200, Joakim Tjernlund wrote:
On Mon, 2017-06-12 at 16:01 +0200, Jakub Hrozek wrote:
On Mon, Jun 12, 2017 at 01:53:27PM +0000, Joakim Tjernlund wrote:
On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote:
On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote:
On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote: > On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote: > > both 1.15.2 and git master hangs after less than 24 hour on > > a server. > > > > I can see this repeating the domain log: > > > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb > > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > This is caused by too long write to disk. >
Can I just increase the timeout for now? I will patch the code if needed. On this sever we need enumerate = true ATM, cannot just turn it off.
Oh, sure. The other alternative might be to mount the cache to tmpfs.
After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. Seems to an additional problem here.
I don't this AD is that big either: # > getent passwd | wc -l 3236 # > getent group | wc -l 885
Any ideas?
Can you get a pstack of when the process is 'stuck' ?
Does increasing the 'timeout' parameter from its default '10' to maybe 30 in the domain section help?
I see ALOT of this in the log( figured I look before I restart sssd)
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [sdap_find_entry_by_origDN] (0x4000): Searching cache for [CN=Jovy\20Sena,OU=Sunnyvale,OU=CorpUsers,DC=infinera,DC=com]. (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x4c28c00
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x4c28cc0
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x4c28c00 "ltdb_callback"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x4c28cc0 "ltdb_timeout"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x4c28c00 "ltdb_callback"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x34ccf50
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x34cd0c0
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x34ccf50 "ltdb_callback"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x34cd0c0 "ltdb_timeout"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x34ccf50 "ltdb_callback"
After just adding timout = 30 and restarting sssd it still hung. Had to clear out(saved a copy first)
^^^^^^^^^^^ There is a typo here, I wonder if you used the correct spelling in the config? Also, did you add the option to the domain section?
the sssd cache as well for normal function.
Jocke _______________________________________________ sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org
On Mon, 2017-06-12 at 17:51 +0200, Jakub Hrozek wrote:
On Mon, Jun 12, 2017 at 03:38:28PM +0000, Joakim Tjernlund wrote:
On Mon, 2017-06-12 at 17:32 +0200, Joakim Tjernlund wrote:
On Mon, 2017-06-12 at 16:01 +0200, Jakub Hrozek wrote:
On Mon, Jun 12, 2017 at 01:53:27PM +0000, Joakim Tjernlund wrote:
On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote:
On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote: > On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote: > > On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote: > > > both 1.15.2 and git master hangs after less than 24 hour on > > > a server. > > > > > > I can see this repeating the domain log: > > > > > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > > This is caused by too long write to disk. > > > > Can I just increase the timeout for now? I will patch the code if needed. > On this sever we need enumerate = true ATM, cannot just turn it off.
Oh, sure. The other alternative might be to mount the cache to tmpfs.
After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. Seems to an additional problem here.
I don't this AD is that big either: # > getent passwd | wc -l 3236 # > getent group | wc -l 885
Any ideas?
Can you get a pstack of when the process is 'stuck' ?
Does increasing the 'timeout' parameter from its default '10' to maybe 30 in the domain section help?
I see ALOT of this in the log( figured I look before I restart sssd)
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [sdap_find_entry_by_origDN] (0x4000): Searching cache for [CN=Jovy\20Sena,OU=Sunnyvale,OU=CorpUsers,DC=infinera,DC=com]. (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x4c28c00
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x4c28cc0
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x4c28c00 "ltdb_callback"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x4c28cc0 "ltdb_timeout"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x4c28c00 "ltdb_callback"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x34ccf50
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x34cd0c0
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x34ccf50 "ltdb_callback"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x34cd0c0 "ltdb_timeout"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x34ccf50 "ltdb_callback"
After just adding timout = 30 and restarting sssd it still hung. Had to clear out(saved a copy first)
^^^^^^^^^^^
There is a typo here, I wonder if you used the correct spelling in the config? Also, did you add the option to the domain section?
It is now :) was in the wrong section before
On Mon, 2017-06-12 at 18:06 +0200, Joakim Tjernlund wrote:
On Mon, 2017-06-12 at 17:51 +0200, Jakub Hrozek wrote:
On Mon, Jun 12, 2017 at 03:38:28PM +0000, Joakim Tjernlund wrote:
On Mon, 2017-06-12 at 17:32 +0200, Joakim Tjernlund wrote:
On Mon, 2017-06-12 at 16:01 +0200, Jakub Hrozek wrote:
On Mon, Jun 12, 2017 at 01:53:27PM +0000, Joakim Tjernlund wrote:
On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote: > On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote: > > On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote: > > > On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote: > > > > both 1.15.2 and git master hangs after less than 24 hour on > > > > a server. > > > > > > > > I can see this repeating the domain log: > > > > > > > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > > > > This is caused by too long write to disk. > > > > > > > Can I just increase the timeout for now? I will patch the code if needed. > > On this sever we need enumerate = true ATM, cannot just turn it off. > > Oh, sure. The other alternative might be to mount the cache to tmpfs.
After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. Seems to an additional problem here.
I don't this AD is that big either: # > getent passwd | wc -l 3236 # > getent group | wc -l 885
Any ideas?
Can you get a pstack of when the process is 'stuck' ?
Does increasing the 'timeout' parameter from its default '10' to maybe 30 in the domain section help?
I see ALOT of this in the log( figured I look before I restart sssd)
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [sdap_find_entry_by_origDN] (0x4000): Searching cache for [CN=Jovy\20Sena,OU=Sunnyvale,OU=CorpUsers,DC=infinera,DC=com]. (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x4c28c00
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x4c28cc0
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x4c28c00 "ltdb_callback"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x4c28cc0 "ltdb_timeout"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x4c28c00 "ltdb_callback"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x34ccf50
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x34cd0c0
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x34ccf50 "ltdb_callback"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x34cd0c0 "ltdb_timeout"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x34ccf50 "ltdb_callback"
After just adding timout = 30 and restarting sssd it still hung. Had to clear out(saved a copy first)
^^^^^^^^^^^
There is a typo here, I wonder if you used the correct spelling in the config? Also, did you add the option to the domain section?
It is now :) was in the wrong section before
timeout = 30 in domain section SEEMS to help, no problem since yesterday. What did I really do here?
Jocke
On Tue, 2017-06-13 at 14:12 +0200, Joakim Tjernlund wrote:
On Mon, 2017-06-12 at 18:06 +0200, Joakim Tjernlund wrote:
On Mon, 2017-06-12 at 17:51 +0200, Jakub Hrozek wrote:
On Mon, Jun 12, 2017 at 03:38:28PM +0000, Joakim Tjernlund wrote:
On Mon, 2017-06-12 at 17:32 +0200, Joakim Tjernlund wrote:
On Mon, 2017-06-12 at 16:01 +0200, Jakub Hrozek wrote:
On Mon, Jun 12, 2017 at 01:53:27PM +0000, Joakim Tjernlund wrote: > On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote: > > On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote: > > > On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote: > > > > On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote: > > > > > both 1.15.2 and git master hangs after less than 24 hour on > > > > > a server. > > > > > > > > > > I can see this repeating the domain log: > > > > > > > > > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > > > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > > > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > > > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > > > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > > > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > > > > > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb > > > > > > > > This is caused by too long write to disk. > > > > > > > > > > Can I just increase the timeout for now? I will patch the code if needed. > > > On this sever we need enumerate = true ATM, cannot just turn it off. > > > > Oh, sure. The other alternative might be to mount the cache to tmpfs. > > After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. > Seems to an additional problem here. > > I don't this AD is that big either: > # > getent passwd | wc -l > 3236 > # > getent group | wc -l > 885 > > Any ideas?
Can you get a pstack of when the process is 'stuck' ?
Does increasing the 'timeout' parameter from its default '10' to maybe 30 in the domain section help?
I see ALOT of this in the log( figured I look before I restart sssd)
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [sdap_find_entry_by_origDN] (0x4000): Searching cache for [CN=Jovy\20Sena,OU=Sunnyvale,OU=CorpUsers,DC=infinera,DC=com]. (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x4c28c00
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x4c28cc0
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x4c28c00 "ltdb_callback"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x4c28cc0 "ltdb_timeout"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x4c28c00 "ltdb_callback"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x34ccf50
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x34cd0c0
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x34ccf50 "ltdb_callback"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x34cd0c0 "ltdb_timeout"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x34ccf50 "ltdb_callback"
After just adding timout = 30 and restarting sssd it still hung. Had to clear out(saved a copy first)
^^^^^^^^^^^
There is a typo here, I wonder if you used the correct spelling in the config? Also, did you add the option to the domain section?
It is now :) was in the wrong section before
timeout = 30 in domain section SEEMS to help, no problem since yesterday. What did I really do here?
However, now I see that getent group/getent group <a-grp-name> is incomplete, members are missing. And it varies between machines, even ones that have enumerate = false has incomplete member list for a random grop name.
Jocke
On Tue, Jun 13, 2017 at 12:34:41PM +0000, Joakim Tjernlund wrote:
timeout = 30 in domain section SEEMS to help, no problem since yesterday. What did I really do here?
However, now I see that getent group/getent group <a-grp-name> is incomplete, members are missing. And it varies between machines, even ones that have enumerate = false has incomplete member list for a random grop name.
Bug-whack-a-mole probably: https://pagure.io/SSSD/sssd/issue/3369 please check the debug logs if there are messages from the "cleanup task".
On Tue, 2017-06-13 at 17:59 +0200, Jakub Hrozek wrote:
On Tue, Jun 13, 2017 at 12:34:41PM +0000, Joakim Tjernlund wrote:
timeout = 30 in domain section SEEMS to help, no problem since yesterday. What did I really do here?
However, now I see that getent group/getent group <a-grp-name> is incomplete, members are missing. And it varies between machines, even ones that have enumerate = false has incomplete member list for a random grop name.
Bug-whack-a-mole probably: https://pagure.io/SSSD/sssd/issue/3369 please check the debug logs if there are messages from the "cleanup task".
Nothing in the logs, what debug level do I need to see this?
On Tue, Jun 13, 2017 at 06:18:24PM +0000, Joakim Tjernlund wrote:
On Tue, 2017-06-13 at 17:59 +0200, Jakub Hrozek wrote:
On Tue, Jun 13, 2017 at 12:34:41PM +0000, Joakim Tjernlund wrote:
timeout = 30 in domain section SEEMS to help, no problem since yesterday. What did I really do here?
However, now I see that getent group/getent group <a-grp-name> is incomplete, members are missing. And it varies between machines, even ones that have enumerate = false has incomplete member list for a random grop name.
Bug-whack-a-mole probably: https://pagure.io/SSSD/sssd/issue/3369 please check the debug logs if there are messages from the "cleanup task".
Nothing in the logs, what debug level do I need to see this?
5 or higher.
On Tue, Jun 13, 2017 at 12:12:05PM +0000, Joakim Tjernlund wrote:
It is now :) was in the wrong section before
timeout = 30 in domain section SEEMS to help, no problem since yesterday. What did I really do here?
There is a ticket to document this better already but tl;dr there is a watchdog that, unless during three ticks of the 'timeout' value, an internal event is received that resets the watchdog, kills the process, because the process is presumed stuck.
What happens when sssd writes so many entries to the cache is that the write operations blocks the event loop, prevents the delivery of the watchdog reset which results in killing of the process.
On Tue, 2017-06-13 at 18:01 +0200, Jakub Hrozek wrote:
On Tue, Jun 13, 2017 at 12:12:05PM +0000, Joakim Tjernlund wrote:
It is now :) was in the wrong section before
timeout = 30 in domain section SEEMS to help, no problem since yesterday. What did I really do here?
There is a ticket to document this better already but tl;dr there is a watchdog that, unless during three ticks of the 'timeout' value, an internal event is received that resets the watchdog, kills the process, because the process is presumed stuck.
What happens when sssd writes so many entries to the cache is that the write operations blocks the event loop, prevents the delivery of the watchdog reset which results in killing of the process.
hmm, on a tmpfs 3*10 secs should be more that enough for that I think. Also, the process(the domain process) was never dead but eating CPU instead.
Jocke
On Tue, Jun 13, 2017 at 06:21:28PM +0000, Joakim Tjernlund wrote:
On Tue, 2017-06-13 at 18:01 +0200, Jakub Hrozek wrote:
On Tue, Jun 13, 2017 at 12:12:05PM +0000, Joakim Tjernlund wrote:
It is now :) was in the wrong section before
timeout = 30 in domain section SEEMS to help, no problem since yesterday. What did I really do here?
There is a ticket to document this better already but tl;dr there is a watchdog that, unless during three ticks of the 'timeout' value, an internal event is received that resets the watchdog, kills the process, because the process is presumed stuck.
What happens when sssd writes so many entries to the cache is that the write operations blocks the event loop, prevents the delivery of the watchdog reset which results in killing of the process.
hmm, on a tmpfs 3*10 secs should be more that enough for that I think. Also, the process(the domain process) was never dead but eating CPU instead.
well, I was not precise earlier, it doesn't have to be writes, but for example the loop you showed checks if all members of a group are cached already or not by searching each member in turn. That is not a write, but can also block the process.
On Mon, Jun 12, 2017 at 03:32:22PM +0000, Joakim Tjernlund wrote:
On Mon, 2017-06-12 at 16:01 +0200, Jakub Hrozek wrote:
On Mon, Jun 12, 2017 at 01:53:27PM +0000, Joakim Tjernlund wrote:
On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote:
On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim Tjernlund wrote:
On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek wrote:
On Fri, Jun 09, 2017 at 04:28:45PM +0000, Joakim Tjernlund wrote: > both 1.15.2 and git master hangs after less than 24 hour on > a server. > > I can see this repeating the domain log: > > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:21:49 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:22:42 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:23:35 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children > (Fri Jun 9 18:24:28 2017) [sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context [0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb
This is caused by too long write to disk.
Can I just increase the timeout for now? I will patch the code if needed. On this sever we need enumerate = true ATM, cannot just turn it off.
Oh, sure. The other alternative might be to mount the cache to tmpfs.
After mounting a tmpfs this morning on /var/lib/sss/db, the error has returned. Seems to an additional problem here.
I don't this AD is that big either: # > getent passwd | wc -l 3236 # > getent group | wc -l 885
Any ideas?
Can you get a pstack of when the process is 'stuck' ?
Does increasing the 'timeout' parameter from its default '10' to maybe 30 in the domain section help?
I see ALOT of this in the log( figured I look before I restart sssd)
Right, this is sssd looking up members for a group it is processing. It is one of the pieces we need to refactor in the next version, because the sdap_async_groups.c module can end up looking the same member for the same group several times during a single group-save operation (IIRC, this is from memory when I was working on perf enhancement in the previous version..)
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [sdap_find_entry_by_origDN] (0x4000): Searching cache for [CN=Jovy\20Sena,OU=Sunnyvale,OU=CorpUsers,DC=infinera,DC=com]. (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x4c28c00
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x4c28cc0
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x4c28c00 "ltdb_callback"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x4c28cc0 "ltdb_timeout"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x4c28c00 "ltdb_callback"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x34ccf50
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x34cd0c0
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Running timer event 0x34ccf50 "ltdb_callback"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Destroying timer event 0x34cd0c0 "ltdb_timeout"
(Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000): Ending timer event 0x34ccf50 "ltdb_callback" _______________________________________________ sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org
sssd-users@lists.fedorahosted.org