[PATCH] kdumpctl: time out after 180 seconds when connecting to ssh host

List overview All Threads
Download

newer

older

[PATCH v5 0/8] kdump: Modify kdump...

[PATCH v4 0/8] kdump: Modify kdump...

WANG Chao

26 May 2014 26 May '14

2:30 a.m.

When starting kdump service with dump target being ssh host, after network-online.target, we connect to ssh host and touch the dump directory to make sure the host is ready to be dumped to.

Chances are after network-online.target, the particular network resource we interest in isn't ready for connecting to the specified ssh host. And at that time, we connect to ssh host and fail.

What we should do is to wait for the specific network resource, not totally depending on network-online.target. But it's relatively complicated to implement. A simple and direct solution would be try as many time as it needs to connect to the configured ssh host. However to avoid a infinitely loop, we time out and fail. I set this time out value to be 180 seconds, and general speaking, 180 seconds would be enough for almost any kind of network to be up and ready.

Signed-off-by: WANG Chao chaowang@redhat.com --- kdumpctl | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/kdumpctl b/kdumpctl index 9cae0c4..0bd6021 100755 --- a/kdumpctl +++ b/kdumpctl @@ -381,8 +381,19 @@ function check_ssh_config() function check_ssh_target() { local _ret - ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH - _ret=$? + local _start _delta + + # Timeout out after 180 seconds, hopefully it's enough. + _start=$(date +%s) + while : ; do + ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH + _ret=$? + _delta=$(($(date +%s) - $_start)) + if [[ $_ret -eq 0 || $_delta -gt 180 ]]; then + break + fi + done + if [ $_ret -ne 0 ]; then echo "Could not create $DUMP_TARGET:$SAVE_PATH, you probably need to run "kdumpctl propagate"" >&2 return 1

-- 1.9.3

Show replies by date

Vivek Goyal

28 May 28 May

9:37 a.m.

On Mon, May 26, 2014 at 03:30:42PM +0800, WANG Chao wrote:

...

When starting kdump service with dump target being ssh host, after network-online.target, we connect to ssh host and touch the dump directory to make sure the host is ready to be dumped to.

Chances are after network-online.target, the particular network resource we interest in isn't ready for connecting to the specified ssh host. And at that time, we connect to ssh host and fail.

What we should do is to wait for the specific network resource, not totally depending on network-online.target. But it's relatively complicated to implement. A simple and direct solution would be try as many time as it needs to connect to the configured ssh host. However to avoid a infinitely loop, we time out and fail. I set this time out value to be 180 seconds, and general speaking, 180 seconds would be enough for almost any kind of network to be up and ready.

Signed-off-by: WANG Chao chaowang@redhat.com

kdumpctl | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/kdumpctl b/kdumpctl index 9cae0c4..0bd6021 100755 --- a/kdumpctl +++ b/kdumpctl @@ -381,8 +381,19 @@ function check_ssh_config() function check_ssh_target() { local _ret

ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH

_ret=$?
local _start _delta

# Timeout out after 180 seconds, hopefully it's enough.

_start=$(date +%s)

while : ; do
ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH
_ret=$?
_delta=$(($(date +%s) - $_start))
if [[ $_ret -eq 0 || $_delta -gt 180 ]]; then
	break
fi
done

Hi Chao,

Few comments.

- I think we should sleep for a while before we retry ssh. Say sleep for 2 seconds.

- I think we need to give brief message about retrying as well as giving up. Something like.

"ssh to $target failed. Will retry after 2 seconds"

"ssh to $target failed after multiple tries."

- We need to define timeout of 180 seconds in kdump-lib.sh and use that everywhere.

- We have ssh operations in dracut-kdump.sh. So this logic of retry should apply everywhere and not just kdumpctl. Isn't it. Same issue will arise in second kernel context if network is not up?

Thanks Vivek

...

if [ $_ret -ne 0 ]; then echo "Could not create $DUMP_TARGET:$SAVE_PATH, you probably need to run "kdumpctl propagate"" >&2 return 1 -- 1.9.3

kexec mailing list kexec@lists.fedoraproject.org https://lists.fedoraproject.org/mailman/listinfo/kexec

WANG Chao

30 May 30 May

1:30 a.m.

On 05/28/14 at 10:37am, Vivek Goyal wrote:

...

On Mon, May 26, 2014 at 03:30:42PM +0800, WANG Chao wrote:

...
When starting kdump service with dump target being ssh host, after network-online.target, we connect to ssh host and touch the dump directory to make sure the host is ready to be dumped to.

Chances are after network-online.target, the particular network resource we interest in isn't ready for connecting to the specified ssh host. And at that time, we connect to ssh host and fail.

What we should do is to wait for the specific network resource, not totally depending on network-online.target. But it's relatively complicated to implement. A simple and direct solution would be try as many time as it needs to connect to the configured ssh host. However to avoid a infinitely loop, we time out and fail. I set this time out value to be 180 seconds, and general speaking, 180 seconds would be enough for almost any kind of network to be up and ready.

Signed-off-by: WANG Chao chaowang@redhat.com

kdumpctl | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/kdumpctl b/kdumpctl index 9cae0c4..0bd6021 100755 --- a/kdumpctl +++ b/kdumpctl @@ -381,8 +381,19 @@ function check_ssh_config() function check_ssh_target() { local _ret

ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH

_ret=$?
local _start _delta

# Timeout out after 180 seconds, hopefully it's enough.

_start=$(date +%s)

while : ; do
ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH
_ret=$?
_delta=$(($(date +%s) - $_start))
if [[ $_ret -eq 0 || $_delta -gt 180 ]]; then
	break
fi
done
Hi Chao,

Few comments.

I think we should sleep for a while before we retry ssh. Say sleep for 2 seconds.

Will do.

...

I think we need to give brief message about retrying as well as giving up. Something like.

"ssh to $target failed. Will retry after 2 seconds"

"ssh to $target failed after multiple tries."

I thought this retry behavior will be sealed and invisiable to user end. I'm fine with outputing a breif message for the retry.

...

We need to define timeout of 180 seconds in kdump-lib.sh and use that everywhere.

I can put it in kdump-lib.sh. But I can't find it useful for our other scripts.

...

We have ssh operations in dracut-kdump.sh. So this logic of retry should apply everywhere and not just kdumpctl. Isn't it. Same issue will arise in second kernel context if network is not up?

Emm.. In 2nd kernel, dracut take care of bringing up network, not kdump script. The timeout/retry is determined in dracut side. I vaguely remember it's 180 seconds too and this value have been proved to work well in the past. So I don't think we would want to touch 2nd kernel.

Thanks WANG Chao

...

Thanks Vivek

...
if [ $_ret -ne 0 ]; then echo "Could not create $DUMP_TARGET:$SAVE_PATH, you probably need to run "kdumpctl propagate"" >&2 return 1 -- 1.9.3

kexec mailing list kexec@lists.fedoraproject.org https://lists.fedoraproject.org/mailman/listinfo/kexec

Vivek Goyal

9:02 a.m.

On Fri, May 30, 2014 at 02:30:07PM +0800, WANG Chao wrote:

...

On 05/28/14 at 10:37am, Vivek Goyal wrote:

...
On Mon, May 26, 2014 at 03:30:42PM +0800, WANG Chao wrote:

...
When starting kdump service with dump target being ssh host, after network-online.target, we connect to ssh host and touch the dump directory to make sure the host is ready to be dumped to.

Chances are after network-online.target, the particular network resource we interest in isn't ready for connecting to the specified ssh host. And at that time, we connect to ssh host and fail.

What we should do is to wait for the specific network resource, not totally depending on network-online.target. But it's relatively complicated to implement. A simple and direct solution would be try as many time as it needs to connect to the configured ssh host. However to avoid a infinitely loop, we time out and fail. I set this time out value to be 180 seconds, and general speaking, 180 seconds would be enough for almost any kind of network to be up and ready.

Signed-off-by: WANG Chao chaowang@redhat.com

kdumpctl | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/kdumpctl b/kdumpctl index 9cae0c4..0bd6021 100755 --- a/kdumpctl +++ b/kdumpctl @@ -381,8 +381,19 @@ function check_ssh_config() function check_ssh_target() { local _ret

ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH

_ret=$?
local _start _delta

# Timeout out after 180 seconds, hopefully it's enough.

_start=$(date +%s)

while : ; do
ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH
_ret=$?
_delta=$(($(date +%s) - $_start))
if [[ $_ret -eq 0 || $_delta -gt 180 ]]; then
	break
fi
done
Hi Chao,

Few comments.

I think we should sleep for a while before we retry ssh. Say sleep for 2 seconds.
Will do.

...

I think we need to give brief message about retrying as well as giving up. Something like.

"ssh to $target failed. Will retry after 2 seconds"

"ssh to $target failed after multiple tries."

I thought this retry behavior will be sealed and invisiable to user end. I'm fine with outputing a breif message for the retry.

These messages will go into per service buffer and user will not see till it does "systemctl status kdump". I am not too particular about it. Sometimes if we are waiting for too long, I want to output some message otherwise people tend to think that service is *hung*.

So one message every 2 or 4 second should not be too bad.

...

...

We need to define timeout of 180 seconds in kdump-lib.sh and use that everywhere.

I can put it in kdump-lib.sh. But I can't find it useful for our other scripts.

...

We have ssh operations in dracut-kdump.sh. So this logic of retry should apply everywhere and not just kdumpctl. Isn't it. Same issue will arise in second kernel context if network is not up?

Emm.. In 2nd kernel, dracut take care of bringing up network, not kdump script. The timeout/retry is determined in dracut side. I vaguely remember it's 180 seconds too and this value have been proved to work well in the past. So I don't think we would want to touch 2nd kernel.

Ok. So is it dracut initqueue logic which makes sure device comes up. I am fine if we don't have this problem in second kernel.

Given the fact that systemd and dracut dependencies are so complex it is hard to say for sure if we are waiting for device to come up or not.

So if draucut-kdump.sh is not sharing this logic, then I agree we don't have to put it in kdump-lib.sb.

Thanks Vivek

WANG Chao

3 Jun 3 Jun

2:35 a.m.

On 05/30/14 at 10:02am, Vivek Goyal wrote:

...

On Fri, May 30, 2014 at 02:30:07PM +0800, WANG Chao wrote:

...
On 05/28/14 at 10:37am, Vivek Goyal wrote:

...
On Mon, May 26, 2014 at 03:30:42PM +0800, WANG Chao wrote:

...
When starting kdump service with dump target being ssh host, after network-online.target, we connect to ssh host and touch the dump directory to make sure the host is ready to be dumped to.

Chances are after network-online.target, the particular network resource we interest in isn't ready for connecting to the specified ssh host. And at that time, we connect to ssh host and fail.

What we should do is to wait for the specific network resource, not totally depending on network-online.target. But it's relatively complicated to implement. A simple and direct solution would be try as many time as it needs to connect to the configured ssh host. However to avoid a infinitely loop, we time out and fail. I set this time out value to be 180 seconds, and general speaking, 180 seconds would be enough for almost any kind of network to be up and ready.

Signed-off-by: WANG Chao chaowang@redhat.com

kdumpctl | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/kdumpctl b/kdumpctl index 9cae0c4..0bd6021 100755 --- a/kdumpctl +++ b/kdumpctl @@ -381,8 +381,19 @@ function check_ssh_config() function check_ssh_target() { local _ret

ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH

_ret=$?
local _start _delta

# Timeout out after 180 seconds, hopefully it's enough.

_start=$(date +%s)

while : ; do
ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH
_ret=$?
_delta=$(($(date +%s) - $_start))
if [[ $_ret -eq 0 || $_delta -gt 180 ]]; then
	break
fi
done
Hi Chao,

Few comments.

I think we should sleep for a while before we retry ssh. Say sleep for 2 seconds.
Will do.

...

I think we need to give brief message about retrying as well as giving up. Something like.

"ssh to $target failed. Will retry after 2 seconds"

"ssh to $target failed after multiple tries."

I thought this retry behavior will be sealed and invisiable to user end. I'm fine with outputing a breif message for the retry.
These messages will go into per service buffer and user will not see till it does "systemctl status kdump". I am not too particular about it. Sometimes if we are waiting for too long, I want to output some message otherwise people tend to think that service is *hung*.

So one message every 2 or 4 second should not be too bad.

Agree. It sounds not bad.

...

...
...

We need to define timeout of 180 seconds in kdump-lib.sh and use that everywhere.

I can put it in kdump-lib.sh. But I can't find it useful for our other scripts.

...

We have ssh operations in dracut-kdump.sh. So this logic of retry should apply everywhere and not just kdumpctl. Isn't it. Same issue will arise in second kernel context if network is not up?

Emm.. In 2nd kernel, dracut take care of bringing up network, not kdump script. The timeout/retry is determined in dracut side. I vaguely remember it's 180 seconds too and this value have been proved to work well in the past. So I don't think we would want to touch 2nd kernel.

Ok. So is it dracut initqueue logic which makes sure device comes up. I am fine if we don't have this problem in second kernel.

Given the fact that systemd and dracut dependencies are so complex it is hard to say for sure if we are waiting for device to come up or not.

So if draucut-kdump.sh is not sharing this logic, then I agree we don't have to put it in kdump-lib.sb.

OK. I'll update this patch and send in short.

Thanks WANG Chao

3623

Age (days ago)

3631

Last active (days ago)

kexec@lists.fedoraproject.org

4 comments

2 participants

tags (0)

participants (2)

Vivek Goyal
WANG Chao