When starting kdump service with dump target being ssh host, after network-online.target, we connect to ssh host and touch the dump directory to make sure the host is ready to be dumped to.
Chances are after network-online.target, the particular network resource we interest in isn't ready for connecting to the specified ssh host. And at that time, we connect to ssh host and fail.
What we should do is to wait for the specific network resource, not totally depending on network-online.target. But it's relatively complicated to implement. A simple and direct solution would be try as many time as it needs to connect to the configured ssh host. However to avoid a infinitely loop, we time out and fail. I set this time out value to be 180 seconds, and general speaking, 180 seconds would be enough for almost any kind of network to be up and ready.
Vivek: - sleep 2 seconds after each retry. - output a brief message for each retry/failure.
Signed-off-by: WANG Chao chaowang@redhat.com --- kdumpctl | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/kdumpctl b/kdumpctl index 9cae0c4..5e4b1b0 100755 --- a/kdumpctl +++ b/kdumpctl @@ -381,9 +381,23 @@ function check_ssh_config() function check_ssh_target() { local _ret - ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH - _ret=$? + local _start _delta + + # Timeout out after 180 seconds, hopefully it's enough. + _start=$(date +%s) + while : ; do + ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH + _ret=$? + _delta=$(($(date +%s) - $_start)) + if [[ $_ret -eq 0 || $_delta -gt 180 ]]; then + break + fi + echo "ssh to $DUMP_TARGET failed. Will retry after 2 seconds" + sleep 2 + done + if [ $_ret -ne 0 ]; then + echo "ssh failed after multiple tries" echo "Could not create $DUMP_TARGET:$SAVE_PATH, you probably need to run "kdumpctl propagate"" >&2 return 1 fi
On Tue, Jun 03, 2014 at 03:55:50PM +0800, WANG Chao wrote:
When starting kdump service with dump target being ssh host, after network-online.target, we connect to ssh host and touch the dump directory to make sure the host is ready to be dumped to.
Chances are after network-online.target, the particular network resource we interest in isn't ready for connecting to the specified ssh host. And at that time, we connect to ssh host and fail.
What we should do is to wait for the specific network resource, not totally depending on network-online.target. But it's relatively complicated to implement. A simple and direct solution would be try as many time as it needs to connect to the configured ssh host. However to avoid a infinitely loop, we time out and fail. I set this time out value to be 180 seconds, and general speaking, 180 seconds would be enough for almost any kind of network to be up and ready.
Vivek:
- sleep 2 seconds after each retry.
- output a brief message for each retry/failure.
Signed-off-by: WANG Chao chaowang@redhat.com
kdumpctl | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/kdumpctl b/kdumpctl index 9cae0c4..5e4b1b0 100755 --- a/kdumpctl +++ b/kdumpctl @@ -381,9 +381,23 @@ function check_ssh_config() function check_ssh_target() { local _ret
- ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH
- _ret=$?
- local _start _delta
- # Timeout out after 180 seconds, hopefully it's enough.
- _start=$(date +%s)
- while : ; do
ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH
_ret=$?
_delta=$(($(date +%s) - $_start))
if [[ $_ret -eq 0 || $_delta -gt 180 ]]; then
break
Hey Chao,
I wanted this "180" second interval being declared separtely instead of using it directly here.
SSH_POLLING_TIMEOUT 180
So either declare it in same file or in kdump-lib.sh as you see fit.
fi
echo "ssh to $DUMP_TARGET failed. Will retry after 2 seconds"
sleep 2
- done
- if [ $_ret -ne 0 ]; then
echo "Could not create $DUMP_TARGET:$SAVE_PATH, you probably need to run "kdumpctl propagate"" >&2 return 1 fiecho "ssh failed after multiple tries"
-- 1.9.3
On 06/03/14 at 10:36am, Vivek Goyal wrote:
On Tue, Jun 03, 2014 at 03:55:50PM +0800, WANG Chao wrote:
When starting kdump service with dump target being ssh host, after network-online.target, we connect to ssh host and touch the dump directory to make sure the host is ready to be dumped to.
Chances are after network-online.target, the particular network resource we interest in isn't ready for connecting to the specified ssh host. And at that time, we connect to ssh host and fail.
What we should do is to wait for the specific network resource, not totally depending on network-online.target. But it's relatively complicated to implement. A simple and direct solution would be try as many time as it needs to connect to the configured ssh host. However to avoid a infinitely loop, we time out and fail. I set this time out value to be 180 seconds, and general speaking, 180 seconds would be enough for almost any kind of network to be up and ready.
Vivek:
- sleep 2 seconds after each retry.
- output a brief message for each retry/failure.
Signed-off-by: WANG Chao chaowang@redhat.com
kdumpctl | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/kdumpctl b/kdumpctl index 9cae0c4..5e4b1b0 100755 --- a/kdumpctl +++ b/kdumpctl @@ -381,9 +381,23 @@ function check_ssh_config() function check_ssh_target() { local _ret
- ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH
- _ret=$?
- local _start _delta
- # Timeout out after 180 seconds, hopefully it's enough.
- _start=$(date +%s)
- while : ; do
ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH
_ret=$?
_delta=$(($(date +%s) - $_start))
if [[ $_ret -eq 0 || $_delta -gt 180 ]]; then
break
Hey Chao,
I wanted this "180" second interval being declared separtely instead of using it directly here.
SSH_POLLING_TIMEOUT 180
So either declare it in same file or in kdump-lib.sh as you see fit.
Sure. Having it declared globally is much better.
Thanks for review WANG Chao
On Tue, Jun 03, 2014 at 03:55:50PM +0800, WANG Chao wrote:
When starting kdump service with dump target being ssh host, after network-online.target, we connect to ssh host and touch the dump directory to make sure the host is ready to be dumped to.
Chances are after network-online.target, the particular network resource we interest in isn't ready for connecting to the specified ssh host. And at that time, we connect to ssh host and fail.
What we should do is to wait for the specific network resource, not totally depending on network-online.target. But it's relatively complicated to implement. A simple and direct solution would be try as many time as it needs to connect to the configured ssh host. However to avoid a infinitely loop, we time out and fail. I set this time out value to be 180 seconds, and general speaking, 180 seconds would be enough for almost any kind of network to be up and ready.
Vivek:
- sleep 2 seconds after each retry.
- output a brief message for each retry/failure.
Signed-off-by: WANG Chao chaowang@redhat.com
kdumpctl | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/kdumpctl b/kdumpctl index 9cae0c4..5e4b1b0 100755 --- a/kdumpctl +++ b/kdumpctl @@ -381,9 +381,23 @@ function check_ssh_config() function check_ssh_target() { local _ret
- ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH
- _ret=$?
- local _start _delta
- # Timeout out after 180 seconds, hopefully it's enough.
- _start=$(date +%s)
- while : ; do
ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH
_ret=$?
_delta=$(($(date +%s) - $_start))
if [[ $_ret -eq 0 || $_delta -gt 180 ]]; then
break
fi
echo "ssh to $DUMP_TARGET failed. Will retry after 2 seconds"
sleep 2
- done
- if [ $_ret -ne 0 ]; then
echo "Could not create $DUMP_TARGET:$SAVE_PATH, you probably need to run "kdumpctl propagate"" >&2echo "ssh failed after multiple tries"
Hold on. So assume that network is up but keys are not propagated or keys are not valid, we will still keep on retyring? That does not sound right.
We need to retry only if network interface is not up. If ssh fails because of no keys or wrong keys, then we should not retry.
Thanks Vivek
On 06/03/14 at 10:37am, Vivek Goyal wrote:
On Tue, Jun 03, 2014 at 03:55:50PM +0800, WANG Chao wrote:
When starting kdump service with dump target being ssh host, after network-online.target, we connect to ssh host and touch the dump directory to make sure the host is ready to be dumped to.
Chances are after network-online.target, the particular network resource we interest in isn't ready for connecting to the specified ssh host. And at that time, we connect to ssh host and fail.
What we should do is to wait for the specific network resource, not totally depending on network-online.target. But it's relatively complicated to implement. A simple and direct solution would be try as many time as it needs to connect to the configured ssh host. However to avoid a infinitely loop, we time out and fail. I set this time out value to be 180 seconds, and general speaking, 180 seconds would be enough for almost any kind of network to be up and ready.
Vivek:
- sleep 2 seconds after each retry.
- output a brief message for each retry/failure.
Signed-off-by: WANG Chao chaowang@redhat.com
kdumpctl | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/kdumpctl b/kdumpctl index 9cae0c4..5e4b1b0 100755 --- a/kdumpctl +++ b/kdumpctl @@ -381,9 +381,23 @@ function check_ssh_config() function check_ssh_target() { local _ret
- ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH
- _ret=$?
- local _start _delta
- # Timeout out after 180 seconds, hopefully it's enough.
- _start=$(date +%s)
- while : ; do
ssh -q -i $SSH_KEY_LOCATION -o BatchMode=yes $DUMP_TARGET mkdir -p $SAVE_PATH
_ret=$?
_delta=$(($(date +%s) - $_start))
if [[ $_ret -eq 0 || $_delta -gt 180 ]]; then
break
fi
echo "ssh to $DUMP_TARGET failed. Will retry after 2 seconds"
sleep 2
- done
- if [ $_ret -ne 0 ]; then
echo "Could not create $DUMP_TARGET:$SAVE_PATH, you probably need to run "kdumpctl propagate"" >&2echo "ssh failed after multiple tries"
Hold on. So assume that network is up but keys are not propagated or keys are not valid, we will still keep on retyring? That does not sound right.
We need to retry only if network interface is not up. If ssh fails because of no keys or wrong keys, then we should not retry.
I'm not sure how can we do this, the return code from ssh is always 255 in any case of failure, ie. wrong key, no key, network issue.
On Wed, Jun 04, 2014 at 11:13:45AM +0800, WANG Chao wrote:
[..]
if [ $_ret -ne 0 ]; then
echo "Could not create $DUMP_TARGET:$SAVE_PATH, you probably need to run "kdumpctl propagate"" >&2echo "ssh failed after multiple tries"
Hold on. So assume that network is up but keys are not propagated or keys are not valid, we will still keep on retyring? That does not sound right.
We need to retry only if network interface is not up. If ssh fails because of no keys or wrong keys, then we should not retry.
I'm not sure how can we do this, the return code from ssh is always 255 in any case of failure, ie. wrong key, no key, network issue.
Hey from DUMP_TARGET, can't we figure out which local network interface it is routed through and then check the status of that network interface?
Thanks Vivek
On 06/04/14 at 09:57am, Vivek Goyal wrote:
On Wed, Jun 04, 2014 at 11:13:45AM +0800, WANG Chao wrote:
[..]
if [ $_ret -ne 0 ]; then
echo "Could not create $DUMP_TARGET:$SAVE_PATH, you probably need to run "kdumpctl propagate"" >&2echo "ssh failed after multiple tries"
Hold on. So assume that network is up but keys are not propagated or keys are not valid, we will still keep on retyring? That does not sound right.
We need to retry only if network interface is not up. If ssh fails because of no keys or wrong keys, then we should not retry.
I'm not sure how can we do this, the return code from ssh is always 255 in any case of failure, ie. wrong key, no key, network issue.
Hey from DUMP_TARGET, can't we figure out which local network interface it is routed through and then check the status of that network interface?
When network isn't ready, we can't really figure out which interface routes to DUMP_TARGET.
There can be situations that local network is up, but there's something wrong with the network connection between the host and local system, or host network is initializing.
In this case, should we fail right away without trying for a few more time? So I'm not too particular to stop trying when local network is up and ssh fails.
I think it's not too bad to fail after 180 seconds. If it's a configuration issue (wrong key, no key..), user could fix it after the first time the kdump service fails, and the next time there would be no such issues and the retry will be only for polling network connection.
What do you think?
Thanks WANG Chao
Thanks Vivek
On Fri, Jun 06, 2014 at 01:55:09PM +0800, WANG Chao wrote:
On 06/04/14 at 09:57am, Vivek Goyal wrote:
On Wed, Jun 04, 2014 at 11:13:45AM +0800, WANG Chao wrote:
[..]
if [ $_ret -ne 0 ]; then
echo "Could not create $DUMP_TARGET:$SAVE_PATH, you probably need to run "kdumpctl propagate"" >&2echo "ssh failed after multiple tries"
Hold on. So assume that network is up but keys are not propagated or keys are not valid, we will still keep on retyring? That does not sound right.
We need to retry only if network interface is not up. If ssh fails because of no keys or wrong keys, then we should not retry.
I'm not sure how can we do this, the return code from ssh is always 255 in any case of failure, ie. wrong key, no key, network issue.
Hey from DUMP_TARGET, can't we figure out which local network interface it is routed through and then check the status of that network interface?
When network isn't ready, we can't really figure out which interface routes to DUMP_TARGET.
There can be situations that local network is up, but there's something wrong with the network connection between the host and local system, or host network is initializing.
I think we need to ask networking folks and also check how apache waits for the interfaces.
In this case, should we fail right away without trying for a few more time? So I'm not too particular to stop trying when local network is up and ssh fails.
I think it's not too bad to fail after 180 seconds. If it's a configuration issue (wrong key, no key..), user could fix it after the first time the kdump service fails, and the next time there would be no such issues and the retry will be only for polling network connection.
In simplest form we could probably use something like "ping" and try to ping target.
But this will have issue if target has specified that don't respond to ping requests.
What do you think?
I am really not convinced that if keys are wrong that we should continue to retry. Expect string of bugs on this.
We need to think of something else.
Thanks Vivek
On 06/06/14 at 02:08pm, Vivek Goyal wrote:
On Fri, Jun 06, 2014 at 01:55:09PM +0800, WANG Chao wrote:
On 06/04/14 at 09:57am, Vivek Goyal wrote:
On Wed, Jun 04, 2014 at 11:13:45AM +0800, WANG Chao wrote:
[..]
if [ $_ret -ne 0 ]; then
echo "Could not create $DUMP_TARGET:$SAVE_PATH, you probably need to run "kdumpctl propagate"" >&2echo "ssh failed after multiple tries"
Hold on. So assume that network is up but keys are not propagated or keys are not valid, we will still keep on retyring? That does not sound right.
We need to retry only if network interface is not up. If ssh fails because of no keys or wrong keys, then we should not retry.
I'm not sure how can we do this, the return code from ssh is always 255 in any case of failure, ie. wrong key, no key, network issue.
Hey from DUMP_TARGET, can't we figure out which local network interface it is routed through and then check the status of that network interface?
When network isn't ready, we can't really figure out which interface routes to DUMP_TARGET.
There can be situations that local network is up, but there's something wrong with the network connection between the host and local system, or host network is initializing.
I think we need to ask networking folks and also check how apache waits for the interfaces.
In this case, should we fail right away without trying for a few more time? So I'm not too particular to stop trying when local network is up and ssh fails.
I think it's not too bad to fail after 180 seconds. If it's a configuration issue (wrong key, no key..), user could fix it after the first time the kdump service fails, and the next time there would be no such issues and the retry will be only for polling network connection.
In simplest form we could probably use something like "ping" and try to ping target.
But this will have issue if target has specified that don't respond to ping requests.
Yep, that could be the case...
What do you think?
I am really not convinced that if keys are wrong that we should continue to retry. Expect string of bugs on this.
The question is how we can distinguish the case of wrong keys and network disconnection. ssh utility always returns 255 in failure.
What's more is network disconnection can be various reasons: - local network isn't ready yet (no ip address) - host network isn't ready yet. - network connection somehow fails: - router isn't working this time. - packet lost because connection isn't stable.
I agree that we should treat the issue of wrong keys differently from other issues. But the question is how we can seperate. As long as it's figured out, we can handle this kind of failure differently ...
Thanks WANG Chao
On Mon, Jun 09, 2014 at 07:01:32PM +0800, WANG Chao wrote:
[..]
I am really not convinced that if keys are wrong that we should continue to retry. Expect string of bugs on this.
The question is how we can distinguish the case of wrong keys and network disconnection. ssh utility always returns 255 in failure.
What's more is network disconnection can be various reasons:
- local network isn't ready yet (no ip address)
- host network isn't ready yet.
- network connection somehow fails:
- router isn't working this time.
- packet lost because connection isn't stable.
I don't think we can take care of issues like router isn't working. packet lost should be taken care by TCP/IP protocol.
I agree that we should treat the issue of wrong keys differently from other issues. But the question is how we can seperate. As long as it's figured out, we can handle this kind of failure differently ...
I would say for the time being let us not do anything. Let us keep looking and once we better ideas, we can write a patch. Retrying upon key verification failure is going to create us more problems than it solves for us.
Thanks Vivek
On 06/09/14 at 01:07pm, Vivek Goyal wrote:
On Mon, Jun 09, 2014 at 07:01:32PM +0800, WANG Chao wrote:
[..]
I am really not convinced that if keys are wrong that we should continue to retry. Expect string of bugs on this.
The question is how we can distinguish the case of wrong keys and network disconnection. ssh utility always returns 255 in failure.
What's more is network disconnection can be various reasons:
- local network isn't ready yet (no ip address)
- host network isn't ready yet.
- network connection somehow fails:
- router isn't working this time.
- packet lost because connection isn't stable.
I don't think we can take care of issues like router isn't working. packet lost should be taken care by TCP/IP protocol.
You're right. Those are not the cases we should take care.
I agree that we should treat the issue of wrong keys differently from other issues. But the question is how we can seperate. As long as it's figured out, we can handle this kind of failure differently ...
I would say for the time being let us not do anything. Let us keep looking and once we better ideas, we can write a patch. Retrying upon key verification failure is going to create us more problems than it solves for us.
I agree. I'll hold off. If there's a real world demand in the future, we can always look back.