[PATCH v2] kdumpctl: time out after 180 seconds when connecting to ssh host

List overview All Threads
Download

newer

older

[PATCH] makedumpfile: Fix Makefile...

[PATCH v5 0/8] kdump: Modify kdump...

WANG Chao

3 Jun 2014 3 Jun '14

2:55 a.m.

When starting kdump service with dump target being ssh host, after network-online.target, we connect to ssh host and touch the dump directory to make sure the host is ready to be dumped to.

Chances are after network-online.target, the particular network resource we interest in isn't ready for connecting to the specified ssh host. And at that time, we connect to ssh host and fail.

What we should do is to wait for the specific network resource, not totally depending on network-online.target. But it's relatively complicated to implement. A simple and direct solution would be try as many time as it needs to connect to the configured ssh host. However to avoid a infinitely loop, we time out and fail. I set this time out value to be 180 seconds, and general speaking, 180 seconds would be enough for almost any kind of network to be up and ready.

Vivek: - sleep 2 seconds after each retry. - output a brief message for each retry/failure.

Signed-off-by: WANG Chao chaowang@redhat.com --- kdumpctl | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kdumpctl b/kdumpctl index 9cae0c4..5e4b1b0 100755 --- a/kdumpctl +++ b/kdumpctl @@ -381,9 +381,23 @@ function check_ssh_config() function check_ssh_target() { local _ret - ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH - _ret=$? + local _start _delta + + # Timeout out after 180 seconds, hopefully it's enough. + _start=$(date +%s) + while : ; do + ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH + _ret=$? + _delta=$(($(date +%s) - $_start)) + if [[ $_ret -eq 0 || $_delta -gt 180 ]]; then + break + fi + echo "ssh to $DUMP_TARGET failed. Will retry after 2 seconds" + sleep 2 + done + if [ $_ret -ne 0 ]; then + echo "ssh failed after multiple tries" echo "Could not create $DUMP_TARGET:$SAVE_PATH, you probably need to run "kdumpctl propagate"" >&2 return 1 fi

-- 1.9.3

Show replies by date

Vivek Goyal

3 Jun 3 Jun

9:36 a.m.

On Tue, Jun 03, 2014 at 03:55:50PM +0800, WANG Chao wrote:

...

When starting kdump service with dump target being ssh host, after network-online.target, we connect to ssh host and touch the dump directory to make sure the host is ready to be dumped to.

Chances are after network-online.target, the particular network resource we interest in isn't ready for connecting to the specified ssh host. And at that time, we connect to ssh host and fail.

What we should do is to wait for the specific network resource, not totally depending on network-online.target. But it's relatively complicated to implement. A simple and direct solution would be try as many time as it needs to connect to the configured ssh host. However to avoid a infinitely loop, we time out and fail. I set this time out value to be 180 seconds, and general speaking, 180 seconds would be enough for almost any kind of network to be up and ready.

Vivek:

sleep 2 seconds after each retry.

output a brief message for each retry/failure.

Signed-off-by: WANG Chao chaowang@redhat.com

kdumpctl | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kdumpctl b/kdumpctl index 9cae0c4..5e4b1b0 100755 --- a/kdumpctl +++ b/kdumpctl @@ -381,9 +381,23 @@ function check_ssh_config() function check_ssh_target() { local _ret

ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH

_ret=$?
local _start _delta

# Timeout out after 180 seconds, hopefully it's enough.

_start=$(date +%s)

while : ; do
ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH
_ret=$?
_delta=$(($(date +%s) - $_start))
if [[ $_ret -eq 0 || $_delta -gt 180 ]]; then
	break

Hey Chao,

I wanted this "180" second interval being declared separtely instead of using it directly here.

SSH_POLLING_TIMEOUT 180

So either declare it in same file or in kdump-lib.sh as you see fit.

...

fi
echo "ssh to $DUMP_TARGET failed. Will retry after 2 seconds"
sleep 2
done

if [ $_ret -ne 0 ]; then
echo "ssh failed after multiple tries"
echo "Could not create $DUMP_TARGET:$SAVE_PATH, you probably need to run "kdumpctl propagate"" >&2 return 1 fi
-- 1.9.3

WANG Chao

10:16 p.m.

On 06/03/14 at 10:36am, Vivek Goyal wrote:

...

On Tue, Jun 03, 2014 at 03:55:50PM +0800, WANG Chao wrote:

...
When starting kdump service with dump target being ssh host, after network-online.target, we connect to ssh host and touch the dump directory to make sure the host is ready to be dumped to.

Chances are after network-online.target, the particular network resource we interest in isn't ready for connecting to the specified ssh host. And at that time, we connect to ssh host and fail.

What we should do is to wait for the specific network resource, not totally depending on network-online.target. But it's relatively complicated to implement. A simple and direct solution would be try as many time as it needs to connect to the configured ssh host. However to avoid a infinitely loop, we time out and fail. I set this time out value to be 180 seconds, and general speaking, 180 seconds would be enough for almost any kind of network to be up and ready.

Vivek:

sleep 2 seconds after each retry.

output a brief message for each retry/failure.

Signed-off-by: WANG Chao chaowang@redhat.com

kdumpctl | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kdumpctl b/kdumpctl index 9cae0c4..5e4b1b0 100755 --- a/kdumpctl +++ b/kdumpctl @@ -381,9 +381,23 @@ function check_ssh_config() function check_ssh_target() { local _ret

ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH

_ret=$?
local _start _delta

# Timeout out after 180 seconds, hopefully it's enough.

_start=$(date +%s)

while : ; do
ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH
_ret=$?
_delta=$(($(date +%s) - $_start))
if [[ $_ret -eq 0 || $_delta -gt 180 ]]; then
	break
Hey Chao,

I wanted this "180" second interval being declared separtely instead of using it directly here.

SSH_POLLING_TIMEOUT 180

So either declare it in same file or in kdump-lib.sh as you see fit.

Sure. Having it declared globally is much better.

Thanks for review WANG Chao

Vivek Goyal

9:37 a.m.

On Tue, Jun 03, 2014 at 03:55:50PM +0800, WANG Chao wrote:

...

When starting kdump service with dump target being ssh host, after network-online.target, we connect to ssh host and touch the dump directory to make sure the host is ready to be dumped to.

Chances are after network-online.target, the particular network resource we interest in isn't ready for connecting to the specified ssh host. And at that time, we connect to ssh host and fail.

What we should do is to wait for the specific network resource, not totally depending on network-online.target. But it's relatively complicated to implement. A simple and direct solution would be try as many time as it needs to connect to the configured ssh host. However to avoid a infinitely loop, we time out and fail. I set this time out value to be 180 seconds, and general speaking, 180 seconds would be enough for almost any kind of network to be up and ready.

Vivek:

sleep 2 seconds after each retry.

output a brief message for each retry/failure.

Signed-off-by: WANG Chao chaowang@redhat.com

kdumpctl | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kdumpctl b/kdumpctl index 9cae0c4..5e4b1b0 100755 --- a/kdumpctl +++ b/kdumpctl @@ -381,9 +381,23 @@ function check_ssh_config() function check_ssh_target() { local _ret

ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH

_ret=$?
local _start _delta

# Timeout out after 180 seconds, hopefully it's enough.

_start=$(date +%s)

while : ; do
ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH
_ret=$?
_delta=$(($(date +%s) - $_start))
if [[ $_ret -eq 0 || $_delta -gt 180 ]]; then
	break
fi
echo "ssh to $DUMP_TARGET failed. Will retry after 2 seconds"
sleep 2
done

if [ $_ret -ne 0 ]; then
echo "ssh failed after multiple tries"
echo "Could not create $DUMP_TARGET:$SAVE_PATH, you probably need to run "kdumpctl propagate"" >&2

Hold on. So assume that network is up but keys are not propagated or keys are not valid, we will still keep on retyring? That does not sound right.

We need to retry only if network interface is not up. If ssh fails because of no keys or wrong keys, then we should not retry.

Thanks Vivek

WANG Chao

10:13 p.m.

On 06/03/14 at 10:37am, Vivek Goyal wrote:

...

On Tue, Jun 03, 2014 at 03:55:50PM +0800, WANG Chao wrote:

...
When starting kdump service with dump target being ssh host, after network-online.target, we connect to ssh host and touch the dump directory to make sure the host is ready to be dumped to.

Chances are after network-online.target, the particular network resource we interest in isn't ready for connecting to the specified ssh host. And at that time, we connect to ssh host and fail.

What we should do is to wait for the specific network resource, not totally depending on network-online.target. But it's relatively complicated to implement. A simple and direct solution would be try as many time as it needs to connect to the configured ssh host. However to avoid a infinitely loop, we time out and fail. I set this time out value to be 180 seconds, and general speaking, 180 seconds would be enough for almost any kind of network to be up and ready.

Vivek:

sleep 2 seconds after each retry.

output a brief message for each retry/failure.

Signed-off-by: WANG Chao chaowang@redhat.com

kdumpctl | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kdumpctl b/kdumpctl index 9cae0c4..5e4b1b0 100755 --- a/kdumpctl +++ b/kdumpctl @@ -381,9 +381,23 @@ function check_ssh_config() function check_ssh_target() { local _ret

ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH

_ret=$?
local _start _delta

# Timeout out after 180 seconds, hopefully it's enough.

_start=$(date +%s)

while : ; do
ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH
_ret=$?
_delta=$(($(date +%s) - $_start))
if [[ $_ret -eq 0 || $_delta -gt 180 ]]; then
	break
fi
echo "ssh to $DUMP_TARGET failed. Will retry after 2 seconds"
sleep 2
done

if [ $_ret -ne 0 ]; then
echo "ssh failed after multiple tries"
echo "Could not create $DUMP_TARGET:$SAVE_PATH, you probably need to run "kdumpctl propagate"" >&2
Hold on. So assume that network is up but keys are not propagated or keys are not valid, we will still keep on retyring? That does not sound right.

We need to retry only if network interface is not up. If ssh fails because of no keys or wrong keys, then we should not retry.

I'm not sure how can we do this, the return code from ssh is always 255 in any case of failure, ie. wrong key, no key, network issue.

Vivek Goyal

4 Jun 4 Jun

8:57 a.m.

On Wed, Jun 04, 2014 at 11:13:45AM +0800, WANG Chao wrote:

[..]

...

...
...
if [ $_ret -ne 0 ]; then
echo "ssh failed after multiple tries"
echo "Could not create $DUMP_TARGET:$SAVE_PATH, you probably need to run "kdumpctl propagate"" >&2
Hold on. So assume that network is up but keys are not propagated or keys are not valid, we will still keep on retyring? That does not sound right.

We need to retry only if network interface is not up. If ssh fails because of no keys or wrong keys, then we should not retry.
I'm not sure how can we do this, the return code from ssh is always 255 in any case of failure, ie. wrong key, no key, network issue.

Hey from DUMP_TARGET, can't we figure out which local network interface it is routed through and then check the status of that network interface?

Thanks Vivek

WANG Chao

6 Jun 6 Jun

12:55 a.m.

On 06/04/14 at 09:57am, Vivek Goyal wrote:

...

On Wed, Jun 04, 2014 at 11:13:45AM +0800, WANG Chao wrote:

[..]

...
...
...
if [ $_ret -ne 0 ]; then
echo "ssh failed after multiple tries"
echo "Could not create $DUMP_TARGET:$SAVE_PATH, you probably need to run "kdumpctl propagate"" >&2
Hold on. So assume that network is up but keys are not propagated or keys are not valid, we will still keep on retyring? That does not sound right.

We need to retry only if network interface is not up. If ssh fails because of no keys or wrong keys, then we should not retry.
I'm not sure how can we do this, the return code from ssh is always 255 in any case of failure, ie. wrong key, no key, network issue.
Hey from DUMP_TARGET, can't we figure out which local network interface it is routed through and then check the status of that network interface?

When network isn't ready, we can't really figure out which interface routes to DUMP_TARGET.

There can be situations that local network is up, but there's something wrong with the network connection between the host and local system, or host network is initializing.

In this case, should we fail right away without trying for a few more time? So I'm not too particular to stop trying when local network is up and ssh fails.

I think it's not too bad to fail after 180 seconds. If it's a configuration issue (wrong key, no key..), user could fix it after the first time the kdump service fails, and the next time there would be no such issues and the retry will be only for polling network connection.

What do you think?

Thanks WANG Chao

...

Thanks Vivek

Vivek Goyal

1:08 p.m.

On Fri, Jun 06, 2014 at 01:55:09PM +0800, WANG Chao wrote:

...

On 06/04/14 at 09:57am, Vivek Goyal wrote:

...
On Wed, Jun 04, 2014 at 11:13:45AM +0800, WANG Chao wrote:

[..]

...
...
...
if [ $_ret -ne 0 ]; then
echo "ssh failed after multiple tries"
echo "Could not create $DUMP_TARGET:$SAVE_PATH, you probably need to run "kdumpctl propagate"" >&2
Hold on. So assume that network is up but keys are not propagated or keys are not valid, we will still keep on retyring? That does not sound right.

We need to retry only if network interface is not up. If ssh fails because of no keys or wrong keys, then we should not retry.
I'm not sure how can we do this, the return code from ssh is always 255 in any case of failure, ie. wrong key, no key, network issue.
Hey from DUMP_TARGET, can't we figure out which local network interface it is routed through and then check the status of that network interface?
When network isn't ready, we can't really figure out which interface routes to DUMP_TARGET.

There can be situations that local network is up, but there's something wrong with the network connection between the host and local system, or host network is initializing.

I think we need to ask networking folks and also check how apache waits for the interfaces.

...

In this case, should we fail right away without trying for a few more time? So I'm not too particular to stop trying when local network is up and ssh fails.

I think it's not too bad to fail after 180 seconds. If it's a configuration issue (wrong key, no key..), user could fix it after the first time the kdump service fails, and the next time there would be no such issues and the retry will be only for polling network connection.

In simplest form we could probably use something like "ping" and try to ping target.

But this will have issue if target has specified that don't respond to ping requests.

...

What do you think?

I am really not convinced that if keys are wrong that we should continue to retry. Expect string of bugs on this.

We need to think of something else.

Thanks Vivek

WANG Chao

9 Jun 9 Jun

6:01 a.m.

On 06/06/14 at 02:08pm, Vivek Goyal wrote:

...

On Fri, Jun 06, 2014 at 01:55:09PM +0800, WANG Chao wrote:

...
On 06/04/14 at 09:57am, Vivek Goyal wrote:

...
On Wed, Jun 04, 2014 at 11:13:45AM +0800, WANG Chao wrote:

[..]

...
...
...
if [ $_ret -ne 0 ]; then
echo "ssh failed after multiple tries"
echo "Could not create $DUMP_TARGET:$SAVE_PATH, you probably need to run "kdumpctl propagate"" >&2
Hold on. So assume that network is up but keys are not propagated or keys are not valid, we will still keep on retyring? That does not sound right.

We need to retry only if network interface is not up. If ssh fails because of no keys or wrong keys, then we should not retry.
I'm not sure how can we do this, the return code from ssh is always 255 in any case of failure, ie. wrong key, no key, network issue.
Hey from DUMP_TARGET, can't we figure out which local network interface it is routed through and then check the status of that network interface?
When network isn't ready, we can't really figure out which interface routes to DUMP_TARGET.

There can be situations that local network is up, but there's something wrong with the network connection between the host and local system, or host network is initializing.
I think we need to ask networking folks and also check how apache waits for the interfaces.

...
In this case, should we fail right away without trying for a few more time? So I'm not too particular to stop trying when local network is up and ssh fails.

I think it's not too bad to fail after 180 seconds. If it's a configuration issue (wrong key, no key..), user could fix it after the first time the kdump service fails, and the next time there would be no such issues and the retry will be only for polling network connection.

In simplest form we could probably use something like "ping" and try to ping target.

But this will have issue if target has specified that don't respond to ping requests.

Yep, that could be the case...

...

...
What do you think?

I am really not convinced that if keys are wrong that we should continue to retry. Expect string of bugs on this.

The question is how we can distinguish the case of wrong keys and network disconnection. ssh utility always returns 255 in failure.

What's more is network disconnection can be various reasons: - local network isn't ready yet (no ip address) - host network isn't ready yet. - network connection somehow fails: - router isn't working this time. - packet lost because connection isn't stable.

I agree that we should treat the issue of wrong keys differently from other issues. But the question is how we can seperate. As long as it's figured out, we can handle this kind of failure differently ...

Thanks WANG Chao

Vivek Goyal

12:07 p.m.

On Mon, Jun 09, 2014 at 07:01:32PM +0800, WANG Chao wrote:

[..]

...

...
I am really not convinced that if keys are wrong that we should continue to retry. Expect string of bugs on this.

The question is how we can distinguish the case of wrong keys and network disconnection. ssh utility always returns 255 in failure.

What's more is network disconnection can be various reasons:

local network isn't ready yet (no ip address)

host network isn't ready yet.

network connection somehow fails:

router isn't working this time.

packet lost because connection isn't stable.

I don't think we can take care of issues like router isn't working. packet lost should be taken care by TCP/IP protocol.

...

I agree that we should treat the issue of wrong keys differently from other issues. But the question is how we can seperate. As long as it's figured out, we can handle this kind of failure differently ...

I would say for the time being let us not do anything. Let us keep looking and once we better ideas, we can write a patch. Retrying upon key verification failure is going to create us more problems than it solves for us.

Thanks Vivek

WANG Chao

10:14 p.m.

On 06/09/14 at 01:07pm, Vivek Goyal wrote:

...

On Mon, Jun 09, 2014 at 07:01:32PM +0800, WANG Chao wrote:

[..]

...
...
I am really not convinced that if keys are wrong that we should continue to retry. Expect string of bugs on this.

The question is how we can distinguish the case of wrong keys and network disconnection. ssh utility always returns 255 in failure.

What's more is network disconnection can be various reasons:

local network isn't ready yet (no ip address)

host network isn't ready yet.

network connection somehow fails:

router isn't working this time.

packet lost because connection isn't stable.

I don't think we can take care of issues like router isn't working. packet lost should be taken care by TCP/IP protocol.

You're right. Those are not the cases we should take care.

...

...
I agree that we should treat the issue of wrong keys differently from other issues. But the question is how we can seperate. As long as it's figured out, we can handle this kind of failure differently ...

I would say for the time being let us not do anything. Let us keep looking and once we better ideas, we can write a patch. Retrying upon key verification failure is going to create us more problems than it solves for us.

I agree. I'll hold off. If there's a real world demand in the future, we can always look back.

3612

Age (days ago)

3619

Last active (days ago)

kexec@lists.fedoraproject.org

10 comments

2 participants

tags (0)

participants (2)

Vivek Goyal
WANG Chao