Hello everone,
we have an issue/problem that seems highly related to the current "timeout and offline mode behaviour".
We have several SSSD 1.9.5 systems with "cache_credentials = true", which experience periodic blocking/pausing of the complete X session for several seconds (~5-10) when these machines are not connected to their home network. The user is not able to use the machine at all during these lockups, which is very annoying...
These are some of the details: 1) The LDAP server has a local 192.168.x.x IP, which is resolvable globally. 2) The LDAP server is behind a firewall, dropping packets from outside networks. 3) The lockups seem to occur if the clients are not connected to this "home" network but to a different network with similar configuration (192.168.x.x, which is pretty common).
I am still trying to track down the issue with "debug_level = 9" and get the logs of such a lockup.
@Jakub: Could you let me know which openLDAP version you are using and suspecting to be the problem? Could you let me know how to trigger/test the issue reliably (the DNS setup you are using to trigger it)?
I would like to confirm that we are having the same or possibly different problem...
Best regards, Joschi Brauchle
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 06/02/2014 04:34 AM, Joschi Brauchle wrote:
Hello everone,
we have an issue/problem that seems highly related to the current "timeout and offline mode behaviour".
We have several SSSD 1.9.5 systems with "cache_credentials = true", which experience periodic blocking/pausing of the complete X session for several seconds (~5-10) when these machines are not connected to their home network. The user is not able to use the machine at all during these lockups, which is very annoying...
These are some of the details: 1) The LDAP server has a local 192.168.x.x IP, which is resolvable globally. 2) The LDAP server is behind a firewall, dropping packets from outside networks.
This is the real problem. If SSSD can route to the IP address, then we have to proceed assuming that the LDAP server should be available (thereby attempting to connect to it and perform online authentication). There's really no way to determine ahead of time whether the service is "supposed" to be available.
You may want to play with the option 'ldap_opt_timeout' (see sssd-ldap(5)). It controls how long the OpenLDAP client libraries will wait for a response (in your case, how long it will wait while the packets are dropped. It defaults to 6s).
- The lockups seem to occur if the clients are not connected to
this "home" network but to a different network with similar configuration (192.168.x.x, which is pretty common).
I am still trying to track down the issue with "debug_level = 9" and get the logs of such a lockup.
@Jakub: Could you let me know which openLDAP version you are using and suspecting to be the problem? Could you let me know how to trigger/test the issue reliably (the DNS setup you are using to trigger it)?
I would like to confirm that we are having the same or possibly different problem...
Best regards, Joschi Brauchle
_______________________________________________ sssd-users mailing list sssd-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/sssd-users
On Mon, 2 Jun 2014, Stephen Gallagher wrote:
This is the real problem. If SSSD can route to the IP address, then we have to proceed assuming that the LDAP server should be available (thereby attempting to connect to it and perform online authentication). There's really no way to determine ahead of time whether the service is "supposed" to be available.
You may want to play with the option 'ldap_opt_timeout' (see sssd-ldap(5)). It controls how long the OpenLDAP client libraries will wait for a response (in your case, how long it will wait while the packets are dropped. It defaults to 6s).
This should be a one off hit though, right? If I discover the LDAP server is offline, I should remember this, admittedly recheck periodically, but never cause another delay waiting for it to spring back into life. Given the way some of these laptops are used, I'd even quite like to configure it to default to this state.
When I last tried this (which was a while ago) these delays would happen repeatedly, so the setup was unusable, and I had to ditch sssd on the laptop.
jh
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 06/02/2014 07:51 AM, John Hodrien wrote:
On Mon, 2 Jun 2014, Stephen Gallagher wrote:
This is the real problem. If SSSD can route to the IP address, then we have to proceed assuming that the LDAP server should be available (thereby attempting to connect to it and perform online authentication). There's really no way to determine ahead of time whether the service is "supposed" to be available.
You may want to play with the option 'ldap_opt_timeout' (see sssd-ldap(5)). It controls how long the OpenLDAP client libraries will wait for a response (in your case, how long it will wait while the packets are dropped. It defaults to 6s).
This should be a one off hit though, right? If I discover the LDAP server is offline, I should remember this, admittedly recheck periodically, but never cause another delay waiting for it to spring back into life. Given the way some of these laptops are used, I'd even quite like to configure it to default to this state.
When I last tried this (which was a while ago) these delays would happen repeatedly, so the setup was unusable, and I had to ditch sssd on the laptop.
Well, in most common cases, the LDAP server is unresolvable when not on the VPN/inside the network, so SSSD immediately detects that it can't get there and the delay is unnoticeable.
It's those cases where the server is addressable but unresponsive that is much harder to handle.
Right now, we have a two-minute sleep between operations trying to go online again. (I think I saw a patch go in for 1.12 that makes this configurable). That's mostly so that we catch cases where you've connected to the VPN but for one reason or another SSSD doesn't get notified that the network state changed (there are lots of edge-cases that cause this).
On Mon, 2014-06-02 at 10:34 +0200, Joschi Brauchle wrote:
Hello everone,
we have an issue/problem that seems highly related to the current "timeout and offline mode behaviour".
We have several SSSD 1.9.5 systems with "cache_credentials = true", which experience periodic blocking/pausing of the complete X session for several seconds (~5-10) when these machines are not connected to their home network. The user is not able to use the machine at all during these lockups, which is very annoying...
These are some of the details:
- The LDAP server has a local 192.168.x.x IP, which is resolvable globally.
IPs in the 129.168 are not resolvable globally, they are private addresses and routing is not allowed outside private networks.
- The LDAP server is behind a firewall, dropping packets from outside
networks.
Tell your firewall admins to return a ICMP Reject, not just drop packets on the floor. There is only so much we can do with badly configured firewalls (this will help only on your own network) or networks.
- The lockups seem to occur if the clients are not connected to this
"home" network but to a different network with similar configuration (192.168.x.x, which is pretty common).
On other networks there isn't much you can do, except to avoid returning DNS replies at all, this way the client does not even attempt to contact a random 'private' IP address on a different network.
Of course if you have host names in /etc/hosts or use directly ip addresses in sssd.conf configuration there isn't much you can do, again.
At most you can play with timeouts to reduce the issues.
Simo.
On 06/02/2014 05:39 PM, Simo Sorce wrote:
On Mon, 2014-06-02 at 10:34 +0200, Joschi Brauchle wrote:
These are some of the details:
- The LDAP server has a local 192.168.x.x IP, which is resolvable globally.
IPs in the 129.168 are not resolvable globally, they are private addresses and routing is not allowed outside private networks.
What I meant is that the DNS hostname of the LDAP server 'gemini.lnt.ei.tum.de' is globally resolvable to its private '192.168.109.3' IP address. Unfortunately, this DNS setup cannot be changed..
So I finally have a "debug_level = 9" log output from a lockup: ---------------------------- (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sbus_message_handler] (0x4000): Received SBUS method [getAccountInfo] (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info] (0x0100): Got request for [4097][1][idnumber=10011] (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info] (0x0100): Request processed. Returned 1,11,Fast reply - offline (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step] (0x4000): beginning to connect (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP' (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [get_server_status] (0x1000): Status of server 'gemini.lnt.ei.tum.de' is 'name resolved' (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [get_port_status] (0x1000): Port status of port 636 for server 'gemini.lnt.ei.tum.de' is 'not working' (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [get_port_status] (0x0100): Reseting the status of port 636 for server 'gemini.lnt.ei.tum.de' (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 10 seconds (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [get_server_status] (0x1000): Status of server 'gemini.lnt.ei.tum.de' is 'name resolved' (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_resolve_server_process] (0x1000): Saving the first resolved server (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_resolve_server_process] (0x0200): Found address for server gemini.lnt.ei.tum.de: [192.168.109.3] TTL 7200 (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sss_ldap_init_send] (0x4000): Using file descriptor [21] for LDAP connection. (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sss_ldap_init_send] (0x0400): Setting 6 seconds timeout for connecting (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sbus_dispatch] (0x4000): dbus conn: 18E8420 (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sbus_dispatch] (0x4000): Dispatching. (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sbus_message_handler] (0x4000): Received SBUS method [getAccountInfo] (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info] (0x0100): Got request for [4097][1][name=brauchle] (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step] (0x4000): waiting for connection to complete (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sbus_dispatch] (0x4000): dbus conn: 18E8420 (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sbus_dispatch] (0x4000): Dispatching. (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sbus_message_handler] (0x4000): Received SBUS method [getAccountInfo] (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info] (0x0100): Got request for [4097][1][idnumber=10011] (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step] (0x4000): waiting for connection to complete (Tue Jun 3 08:16:47 2014) [sssd[be[default]]] [sbus_dispatch] (0x4000): dbus conn: 18BFD50 (Tue Jun 3 08:16:47 2014) [sssd[be[default]]] [sbus_dispatch] (0x4000): Dispatching. (Tue Jun 3 08:16:47 2014) [sssd[be[default]]] [sbus_message_handler] (0x4000): Received SBUS method [ping] (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_async_sys_connect_timeout] (0x0100): The LDAP connection timed out (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed. (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_sys_connect_done] (0x0020): sdap_async_connect_call request failed. (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_handle_release] (0x2000): Trace: sh[0x191fdb0], connected[0], ops[(nil)], ldap[(nil)], destructor_lock[0], release_memory[0] (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [fo_set_port_status] (0x0100): Marking port 636 of server 'gemini.lnt.ei.tum.de' as 'not working' (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP' (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [get_server_status] (0x1000): Status of server 'gemini.lnt.ei.tum.de' is 'name resolved' (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [get_port_status] (0x1000): Port status of port 636 for server 'gemini.lnt.ei.tum.de' is 'not working' (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [fo_resolve_service_send] (0x0020): No available servers for service 'LDAP' (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [be_resolve_server_done] (0x1000): Server resolution failed: 5 (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done] (0x0020): Failed to connect, going offline (5 [Input/output error]) (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [be_mark_offline] (0x2000): Going offline! (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [be_run_offline_cb] (0x0080): Going offline. Running callbacks. (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done] (0x4000): notify offline to op #1 (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done] (0x4000): notify offline to op #2 (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [acctinfo_callback] (0x0100): Request processed. Returned 1,11,Offline (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done] (0x4000): notify offline to op #3 (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [acctinfo_callback] (0x0100): Request processed. Returned 1,11,Offline (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_release_conn_data] (0x4000): releasing unused connection ----------------------------
As the logs are showing, the system was completely unresponsive for about 5 seconds form 08:16:42 to 08:16:47 when the LDAP connection timed out.
I am currently NOT connected to my home '192.168.109.0/24' network, but this is my current network config: ---------------------------- # ifconfig wlan0 Link encap:Ethernet HWaddr ... inet addr:192.168.1.11 Bcast:192.168.1.255 Mask:255.255.255.0 # route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface default SL2141I.home 0.0.0.0 UG 0 0 0 wlan0 192.168.1.0 * 255.255.255.0 U 9 0 0 wlan0 ----------------------------
Pinging the LDAP server 'gemini.lnt.ei.tum.de' results in: ---------------------------- # ping gemini.lnt.ei.tum.de PING gemini.lnt.ei.tum.de (192.168.109.3) 56(84) bytes of data. From 62.53.2.254 icmp_seq=1 Destination Net Unreachable ----------------------------
So why does the LDAP connection not terminate immediately with the same 'destination net unreachable' status but time out instead?
Best regards, Joschi Brauchle
On 3 June 2014 06:25, Joschi Brauchle joschi.brauchle@tum.de wrote:
On 06/02/2014 05:39 PM, Simo Sorce wrote:
On Mon, 2014-06-02 at 10:34 +0200, Joschi Brauchle wrote:
These are some of the details:
- The LDAP server has a local 192.168.x.x IP, which is resolvable
globally.
IPs in the 129.168 are not resolvable globally, they are private addresses and routing is not allowed outside private networks.
What I meant is that the DNS hostname of the LDAP server ' gemini.lnt.ei.tum.de' is globally resolvable to its private '192.168.109.3' IP address. Unfortunately, this DNS setup cannot be changed..
And neither can corporate firewalls be easily changed to return failures instead of dropping packets. (our IPs are non-rfc1918) I know it's well meant advice from Simo, but reality disagrees with the implementation. Also if I read Joschi's further part of the mail correctly it might even happen if the IP stack thinks it's unroutable.
I agree it would be nice and preferable to just adjust the environment to fit this, but many users might not be at liberty to do so. I have a script that adds blackholes to the routing table just to work around this brokenness if it detects a non-corporate network. I'd rather not roll that to the corporate landscape.
Cheers
Thomas
On Tue, 2014-06-03 at 06:36 +0000, Thomas Rücker wrote:
On 3 June 2014 06:25, Joschi Brauchle joschi.brauchle@tum.de wrote: On 06/02/2014 05:39 PM, Simo Sorce wrote:
On Mon, 2014-06-02 at 10:34 +0200, Joschi Brauchle wrote: These are some of the details: 1) The LDAP server has a local 192.168.x.x IP, which is resolvable globally. IPs in the 129.168 are not resolvable globally, they are private addresses and routing is not allowed outside private networks. What I meant is that the DNS hostname of the LDAP server 'gemini.lnt.ei.tum.de' is globally resolvable to its private '192.168.109.3' IP address. Unfortunately, this DNS setup cannot be changed..
And neither can corporate firewalls be easily changed to return failures instead of dropping packets. (our IPs are non-rfc1918) I know it's well meant advice from Simo, but reality disagrees with the implementation. Also if I read Joschi's further part of the mail correctly it might even happen if the IP stack thinks it's unroutable.
Understandable, but there are some conditions SSSD simply cannot know about w/o outside knowledge.
I agree it would be nice and preferable to just adjust the environment to fit this, but many users might not be at liberty to do so.
True, but then they need to be able to do something else at the system level.
I have a script that adds blackholes to the routing table just to work around this brokenness if it detects a non-corporate network. I'd rather not roll that to the corporate landscape.
Well, one way could be to use a DNS name that simply does not resolve outside your network, maybe that is also not feasible, and is subject to caching issues too.
Perhaps we could provide a way yo tell sssd to go permanently offline so you can build a script to turn this on/off based on your external knowledge of the network environment, would that work ?
Simo.
On Tue, Jun 03, 2014 at 08:40:15AM -0400, Simo Sorce wrote:
On Tue, 2014-06-03 at 06:36 +0000, Thomas Rücker wrote:
On 3 June 2014 06:25, Joschi Brauchle joschi.brauchle@tum.de wrote: On 06/02/2014 05:39 PM, Simo Sorce wrote:
On Mon, 2014-06-02 at 10:34 +0200, Joschi Brauchle wrote: These are some of the details: 1) The LDAP server has a local 192.168.x.x IP, which is resolvable globally. IPs in the 129.168 are not resolvable globally, they are private addresses and routing is not allowed outside private networks. What I meant is that the DNS hostname of the LDAP server 'gemini.lnt.ei.tum.de' is globally resolvable to its private '192.168.109.3' IP address. Unfortunately, this DNS setup cannot be changed..
And neither can corporate firewalls be easily changed to return failures instead of dropping packets. (our IPs are non-rfc1918) I know it's well meant advice from Simo, but reality disagrees with the implementation. Also if I read Joschi's further part of the mail correctly it might even happen if the IP stack thinks it's unroutable.
Understandable, but there are some conditions SSSD simply cannot know about w/o outside knowledge.
I agree it would be nice and preferable to just adjust the environment to fit this, but many users might not be at liberty to do so.
True, but then they need to be able to do something else at the system level.
I have a script that adds blackholes to the routing table just to work around this brokenness if it detects a non-corporate network. I'd rather not roll that to the corporate landscape.
Well, one way could be to use a DNS name that simply does not resolve outside your network, maybe that is also not feasible, and is subject to caching issues too.
Perhaps we could provide a way yo tell sssd to go permanently offline so you can build a script to turn this on/off based on your external knowledge of the network environment, would that work ?
That's an interesting thought. Do you mean having a mode (using an option) where an offline failure would be permanent /except/ for receiving signals?
Then the user could drop NM dispatcher scripts that would SIGUSR2 SSSD on joining the right network.
On 04/06/14 10:18, Jakub Hrozek wrote:
On Tue, Jun 03, 2014 at 08:40:15AM -0400, Simo Sorce wrote:
On Tue, 2014-06-03 at 06:36 +0000, Thomas Rücker wrote:
On 3 June 2014 06:25, Joschi Brauchle joschi.brauchle@tum.de wrote: On 06/02/2014 05:39 PM, Simo Sorce wrote:
On Mon, 2014-06-02 at 10:34 +0200, Joschi Brauchle wrote: These are some of the details: 1) The LDAP server has a local 192.168.x.x IP, which is resolvable globally. IPs in the 129.168 are not resolvable globally, they are private addresses and routing is not allowed outside private networks. What I meant is that the DNS hostname of the LDAP server 'gemini.lnt.ei.tum.de' is globally resolvable to its private '192.168.109.3' IP address. Unfortunately, this DNS setup cannot be changed..
And neither can corporate firewalls be easily changed to return failures instead of dropping packets. (our IPs are non-rfc1918) I know it's well meant advice from Simo, but reality disagrees with the implementation. Also if I read Joschi's further part of the mail correctly it might even happen if the IP stack thinks it's unroutable.
Understandable, but there are some conditions SSSD simply cannot know about w/o outside knowledge.
I agree it would be nice and preferable to just adjust the environment to fit this, but many users might not be at liberty to do so.
True, but then they need to be able to do something else at the system level.
I have a script that adds blackholes to the routing table just to work around this brokenness if it detects a non-corporate network. I'd rather not roll that to the corporate landscape.
Well, one way could be to use a DNS name that simply does not resolve outside your network, maybe that is also not feasible, and is subject to caching issues too.
Perhaps we could provide a way yo tell sssd to go permanently offline so you can build a script to turn this on/off based on your external knowledge of the network environment, would that work ?
That's an interesting thought. Do you mean having a mode (using an option) where an offline failure would be permanent /except/ for receiving signals?
Then the user could drop NM dispatcher scripts that would SIGUSR2 SSSD on joining the right network.
That would probably fit our setup very well. We started exploring NM-dispatcher scripts for our "are we inside" state checking anyway and have something quite reliable now. We need that state anyway in several places, so SSSD using the same state would be good and consistent from our point of view. This would allow us to replace "ip route add blackhole 203.0.113.42" with something sensible that tells sssd that it should be offline or online.
I realize that this is just one way to skin this cat, for us it's quite appealing though.
Cheers
Thomas
On Wed, Jun 04, 2014 at 11:18:52AM +0300, "Thomas B. Rücker" wrote:
On 04/06/14 10:18, Jakub Hrozek wrote:
On Tue, Jun 03, 2014 at 08:40:15AM -0400, Simo Sorce wrote:
On Tue, 2014-06-03 at 06:36 +0000, Thomas Rücker wrote:
On 3 June 2014 06:25, Joschi Brauchle joschi.brauchle@tum.de wrote: On 06/02/2014 05:39 PM, Simo Sorce wrote:
On Mon, 2014-06-02 at 10:34 +0200, Joschi Brauchle wrote: These are some of the details: 1) The LDAP server has a local 192.168.x.x IP, which is resolvable globally. IPs in the 129.168 are not resolvable globally, they are private addresses and routing is not allowed outside private networks. What I meant is that the DNS hostname of the LDAP server 'gemini.lnt.ei.tum.de' is globally resolvable to its private '192.168.109.3' IP address. Unfortunately, this DNS setup cannot be changed..
And neither can corporate firewalls be easily changed to return failures instead of dropping packets. (our IPs are non-rfc1918) I know it's well meant advice from Simo, but reality disagrees with the implementation. Also if I read Joschi's further part of the mail correctly it might even happen if the IP stack thinks it's unroutable.
Understandable, but there are some conditions SSSD simply cannot know about w/o outside knowledge.
I agree it would be nice and preferable to just adjust the environment to fit this, but many users might not be at liberty to do so.
True, but then they need to be able to do something else at the system level.
I have a script that adds blackholes to the routing table just to work around this brokenness if it detects a non-corporate network. I'd rather not roll that to the corporate landscape.
Well, one way could be to use a DNS name that simply does not resolve outside your network, maybe that is also not feasible, and is subject to caching issues too.
Perhaps we could provide a way yo tell sssd to go permanently offline so you can build a script to turn this on/off based on your external knowledge of the network environment, would that work ?
That's an interesting thought. Do you mean having a mode (using an option) where an offline failure would be permanent /except/ for receiving signals?
Then the user could drop NM dispatcher scripts that would SIGUSR2 SSSD on joining the right network.
That would probably fit our setup very well. We started exploring NM-dispatcher scripts for our "are we inside" state checking anyway and have something quite reliable now. We need that state anyway in several places, so SSSD using the same state would be good and consistent from our point of view. This would allow us to replace "ip route add blackhole 203.0.113.42" with something sensible that tells sssd that it should be offline or online.
I realize that this is just one way to skin this cat, for us it's quite appealing though.
Cheers
Thomas
Can you file a ticket in that case, please?
On 06/03/2014 08:25 AM, Joschi Brauchle wrote:
On 06/02/2014 05:39 PM, Simo Sorce wrote:
On Mon, 2014-06-02 at 10:34 +0200, Joschi Brauchle wrote:
These are some of the details:
- The LDAP server has a local 192.168.x.x IP, which is resolvable
globally.
IPs in the 129.168 are not resolvable globally, they are private addresses and routing is not allowed outside private networks.
What I meant is that the DNS hostname of the LDAP server 'gemini.lnt.ei.tum.de' is globally resolvable to its private '192.168.109.3' IP address. Unfortunately, this DNS setup cannot be changed..
So I finally have a "debug_level = 9" log output from a lockup:
(Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sbus_message_handler] (0x4000): Received SBUS method [getAccountInfo] (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info] (0x0100): Got request for [4097][1][idnumber=10011] (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info] (0x0100): Request processed. Returned 1,11,Fast reply - offline (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step] (0x4000): beginning to connect (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP' (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [get_server_status] (0x1000): Status of server 'gemini.lnt.ei.tum.de' is 'name resolved' (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [get_port_status] (0x1000): Port status of port 636 for server 'gemini.lnt.ei.tum.de' is 'not working' (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [get_port_status] (0x0100): Reseting the status of port 636 for server 'gemini.lnt.ei.tum.de' (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 10 seconds (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [get_server_status] (0x1000): Status of server 'gemini.lnt.ei.tum.de' is 'name resolved' (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_resolve_server_process] (0x1000): Saving the first resolved server (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_resolve_server_process] (0x0200): Found address for server gemini.lnt.ei.tum.de: [192.168.109.3] TTL 7200 (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sss_ldap_init_send] (0x4000): Using file descriptor [21] for LDAP connection. (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sss_ldap_init_send] (0x0400): Setting 6 seconds timeout for connecting (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sbus_dispatch] (0x4000): dbus conn: 18E8420 (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sbus_dispatch] (0x4000): Dispatching. (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sbus_message_handler] (0x4000): Received SBUS method [getAccountInfo] (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info] (0x0100): Got request for [4097][1][name=brauchle] (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step] (0x4000): waiting for connection to complete (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sbus_dispatch] (0x4000): dbus conn: 18E8420 (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sbus_dispatch] (0x4000): Dispatching. (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sbus_message_handler] (0x4000): Received SBUS method [getAccountInfo] (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info] (0x0100): Got request for [4097][1][idnumber=10011] (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step] (0x4000): waiting for connection to complete (Tue Jun 3 08:16:47 2014) [sssd[be[default]]] [sbus_dispatch] (0x4000): dbus conn: 18BFD50 (Tue Jun 3 08:16:47 2014) [sssd[be[default]]] [sbus_dispatch] (0x4000): Dispatching. (Tue Jun 3 08:16:47 2014) [sssd[be[default]]] [sbus_message_handler] (0x4000): Received SBUS method [ping] (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_async_sys_connect_timeout] (0x0100): The LDAP connection timed out (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed. (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_sys_connect_done] (0x0020): sdap_async_connect_call request failed. (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_handle_release] (0x2000): Trace: sh[0x191fdb0], connected[0], ops[(nil)], ldap[(nil)], destructor_lock[0], release_memory[0] (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [fo_set_port_status] (0x0100): Marking port 636 of server 'gemini.lnt.ei.tum.de' as 'not working' (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP' (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [get_server_status] (0x1000): Status of server 'gemini.lnt.ei.tum.de' is 'name resolved' (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [get_port_status] (0x1000): Port status of port 636 for server 'gemini.lnt.ei.tum.de' is 'not working' (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [fo_resolve_service_send] (0x0020): No available servers for service 'LDAP' (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [be_resolve_server_done] (0x1000): Server resolution failed: 5 (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done] (0x0020): Failed to connect, going offline (5 [Input/output error]) (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [be_mark_offline] (0x2000): Going offline! (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [be_run_offline_cb] (0x0080): Going offline. Running callbacks. (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done] (0x4000): notify offline to op #1 (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done] (0x4000): notify offline to op #2 (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [acctinfo_callback] (0x0100): Request processed. Returned 1,11,Offline (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done] (0x4000): notify offline to op #3 (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [acctinfo_callback] (0x0100): Request processed. Returned 1,11,Offline (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_release_conn_data] (0x4000): releasing unused connection
As the logs are showing, the system was completely unresponsive for about 5 seconds form 08:16:42 to 08:16:47 when the LDAP connection timed out.
I am currently NOT connected to my home '192.168.109.0/24' network, but this is my current network config:
# ifconfig wlan0 Link encap:Ethernet HWaddr ... inet addr:192.168.1.11 Bcast:192.168.1.255 Mask:255.255.255.0 # route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface default SL2141I.home 0.0.0.0 UG 0 0 0 wlan0 192.168.1.0 * 255.255.255.0 U 9 0 0 wlan0
Pinging the LDAP server 'gemini.lnt.ei.tum.de' results in:
# ping gemini.lnt.ei.tum.de PING gemini.lnt.ei.tum.de (192.168.109.3) 56(84) bytes of data. From 62.53.2.254 icmp_seq=1 Destination Net Unreachable
So why does the LDAP connection not terminate immediately with the same 'destination net unreachable' status but time out instead?
Could anyone comment if this is correct behavior or a bug?
It seems strange to me that the IP stack immediately says the LDAP server is unreachable whereas SSSD/openLDAP times out while trying to reach it...
Is this possibly the same problem as discussed in the earlier part of this thread with openLDAP?
On 06/04/2014 09:41 AM, Joschi Brauchle wrote:
On 06/03/2014 08:25 AM, Joschi Brauchle wrote:
On 06/02/2014 05:39 PM, Simo Sorce wrote:
So I finally have a "debug_level = 9" log output from a lockup:
(Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sbus_message_handler] (0x4000): Received SBUS method [getAccountInfo] (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info] (0x0100): Got request for [4097][1][idnumber=10011] (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info] (0x0100): Request processed. Returned 1,11,Fast reply - offline (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step] (0x4000): beginning to connect (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP' (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [get_server_status] (0x1000): Status of server 'gemini.lnt.ei.tum.de' is 'name resolved' (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [get_port_status] (0x1000): Port status of port 636 for server 'gemini.lnt.ei.tum.de' is 'not working' (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [get_port_status] (0x0100): Reseting the status of port 636 for server 'gemini.lnt.ei.tum.de' (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 10 seconds (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [get_server_status] (0x1000): Status of server 'gemini.lnt.ei.tum.de' is 'name resolved' (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_resolve_server_process] (0x1000): Saving the first resolved server (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_resolve_server_process] (0x0200): Found address for server gemini.lnt.ei.tum.de: [192.168.109.3] TTL 7200 (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sss_ldap_init_send] (0x4000): Using file descriptor [21] for LDAP connection. (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sss_ldap_init_send] (0x0400): Setting 6 seconds timeout for connecting (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sbus_dispatch] (0x4000): dbus conn: 18E8420 (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sbus_dispatch] (0x4000): Dispatching. (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sbus_message_handler] (0x4000): Received SBUS method [getAccountInfo] (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info] (0x0100): Got request for [4097][1][name=brauchle] (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step] (0x4000): waiting for connection to complete (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sbus_dispatch] (0x4000): dbus conn: 18E8420 (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sbus_dispatch] (0x4000): Dispatching. (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sbus_message_handler] (0x4000): Received SBUS method [getAccountInfo] (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info] (0x0100): Got request for [4097][1][idnumber=10011] (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step] (0x4000): waiting for connection to complete (Tue Jun 3 08:16:47 2014) [sssd[be[default]]] [sbus_dispatch] (0x4000): dbus conn: 18BFD50 (Tue Jun 3 08:16:47 2014) [sssd[be[default]]] [sbus_dispatch] (0x4000): Dispatching. (Tue Jun 3 08:16:47 2014) [sssd[be[default]]] [sbus_message_handler] (0x4000): Received SBUS method [ping] (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_async_sys_connect_timeout] (0x0100): The LDAP connection timed out (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed. (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_sys_connect_done] (0x0020): sdap_async_connect_call request failed. (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_handle_release] (0x2000): Trace: sh[0x191fdb0], connected[0], ops[(nil)], ldap[(nil)], destructor_lock[0], release_memory[0] (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [fo_set_port_status] (0x0100): Marking port 636 of server 'gemini.lnt.ei.tum.de' as 'not working' (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP' (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [get_server_status] (0x1000): Status of server 'gemini.lnt.ei.tum.de' is 'name resolved' (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [get_port_status] (0x1000): Port status of port 636 for server 'gemini.lnt.ei.tum.de' is 'not working' (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [fo_resolve_service_send] (0x0020): No available servers for service 'LDAP' (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [be_resolve_server_done] (0x1000): Server resolution failed: 5 (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done] (0x0020): Failed to connect, going offline (5 [Input/output error]) (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [be_mark_offline] (0x2000): Going offline! (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [be_run_offline_cb] (0x0080): Going offline. Running callbacks. (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done] (0x4000): notify offline to op #1 (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done] (0x4000): notify offline to op #2 (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [acctinfo_callback] (0x0100): Request processed. Returned 1,11,Offline (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done] (0x4000): notify offline to op #3 (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [acctinfo_callback] (0x0100): Request processed. Returned 1,11,Offline (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_release_conn_data] (0x4000): releasing unused connection
As the logs are showing, the system was completely unresponsive for about 5 seconds form 08:16:42 to 08:16:47 when the LDAP connection timed out.
I did some further testing ...
As the logs above are showing, the *first* request to SSSD (which is triggering the reconnection trial) gets answered right away in offline mode:
(Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info] (0x0100): Got request for [4097][1][idnumber=10011] (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info] (0x0100): Request processed. Returned 1,11,Fast reply - offline (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step] (0x4000): beginning to connect
...
Only the *subsequent* requests that are received *while* SSSD is trying to (re-)connect to the LDAP server are queued until the connection times out (for at most 6 seconds). These pending requests are causing the system to block and annoy the user:
...
(Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info] (0x0100): Got request for [4097][1][name=brauchle] (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step] (0x4000): waiting for connection to complete
...
(Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info] (0x0100): Got request for [4097][1][idnumber=10011] (Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step] (0x4000): waiting for connection to complete
...
--> this is the time where the system is unresponsive <--
...
(Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done] (0x0020): Failed to connect, going offline (5 [Input/output error]) (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [be_mark_offline] (0x2000): Going offline! (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [be_run_offline_cb] (0x0080): Going offline. Running callbacks. (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done] (0x4000): notify offline to op #1 (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done] (0x4000): notify offline to op #2 (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [acctinfo_callback] (0x0100): Request processed. Returned 1,11,Offline (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done] (0x4000): notify offline to op #3 (Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [acctinfo_callback] (0x0100): Request processed. Returned 1,11,Offline
So why not keep the "offline" flag set to "true" until the LDAP connection succeeds and only if so, go to online mode?
As the first request (triggering the reconnection) is answered from the cache anyway, I dont see the point why one needs to keep the subsequent ones pending until the connection is established successfully...
On Thu, 2014-06-05 at 11:05 +0200, Joschi Brauchle wrote:
So why not keep the "offline" flag set to "true" until the LDAP connection succeeds and only if so, go to online mode?
As the first request (triggering the reconnection) is answered from the cache anyway, I dont see the point why one needs to keep the subsequent ones pending until the connection is established successfully...
This is a very good point. Can you open a ticket so we can track this ? If you do not have an account let me know, and I'll file it.
Simo.
On 06/05/2014 09:03 PM, Simo Sorce wrote:
On Thu, 2014-06-05 at 11:05 +0200, Joschi Brauchle wrote:
So why not keep the "offline" flag set to "true" until the LDAP connection succeeds and only if so, go to online mode?
As the first request (triggering the reconnection) is answered from the cache anyway, I dont see the point why one needs to keep the subsequent ones pending until the connection is established successfully...
This is a very good point. Can you open a ticket so we can track this ? If you do not have an account let me know, and I'll file it.
Simo.
I created https://fedorahosted.org/sssd/ticket/2355 to track the issue.
Thanks for your help... the SSSD users at our institute will certainly look forward for this to be resolved! :)
sssd-users@lists.fedorahosted.org