Hello VDSM developers,
I'd like to discuss and explain the plans for cleaning up Vm.getStats() in vdsm/virt/vm.py, and how it affects a bug we have: https://bugzilla.redhat.com/show_bug.cgi?id=1073478
Let's start from the bug.
To make a long story short, there is a (small) race in VDSM, probably introduced by commit 8fedf8bde3c28edb07add23c3e9b72681cb48e49 The race can actually be triggered only if the VM is shutting down, so it is not that bad.
Fixing the reported issue in the specific case can be done with a trivial if, and that it what I did in my initial http://gerrit.ovirt.org/#/c/25803/1/vdsm/vm.py,cm
However, this is just a bandaid and a proper solution is needed. This is the reason why the following versions of http://gerrit.ovirt.org/#/c/25803 changed direction toward a more comprehensive approach.
And this is the core of the issue. My initial idea is to either return success and a complete, well formed statistics set, or return an error. However current engine seems to not cope properly with this, and we cannot break backward compatibility.
Looks like the only way to go is to always return success and to add a field to describe the content of the statistics (partial, complete...). THis is admittedly a far cry from the ideal solution, but it is dictated by the need to preserve the compatibility with current/old engines.
Moreover, I would like to take the chance and cleanup/refactor the statistics collection. I already started adding the test infrastructure: http://gerrit.ovirt.org/#/c/26536/
To summarize, what I suggest to do is:
* fix https://bugzilla.redhat.com/show_bug.cgi?id=1073478 using a simple ugly fix like the original http://gerrit.ovirt.org/#/c/25803/1/vdsm/vm.py,cm (which I'll resubmit as new change) * refactor and clean getStats() and friends * on the cleaner base, properly fix the statistics collection by let getStats() always succeed and return possibly partial stats, with a new field describing the content
please note that I'm not really happy about this solution, but, given the constraint, I don't see better alternatives.
Feedback welcome!
Bests,
On Tue, Apr 08, 2014 at 06:52:50AM -0400, Francesco Romani wrote:
Hello VDSM developers,
Please use devel@ovirt, and mention "vdsm" at the subject. This thread in particular involves Engine as well.
I'd like to discuss and explain the plans for cleaning up Vm.getStats() in vdsm/virt/vm.py, and how it affects a bug we have: https://bugzilla.redhat.com/show_bug.cgi?id=1073478
Let's start from the bug.
To make a long story short, there is a (small) race in VDSM, probably introduced by commit 8fedf8bde3c28edb07add23c3e9b72681cb48e49 The race can actually be triggered only if the VM is shutting down, so it is not that bad.
Fixing the reported issue in the specific case can be done with a trivial if, and that it what I did in my initial http://gerrit.ovirt.org/#/c/25803/1/vdsm/vm.py,cm
Could you explain how an AttributeError there moved the VM to Down?
However, this is just a bandaid and a proper solution is needed. This is the reason why the following versions of http://gerrit.ovirt.org/#/c/25803 changed direction toward a more comprehensive approach.
And this is the core of the issue. My initial idea is to either return success and a complete, well formed statistics set, or return an error. However current engine seems to not cope properly with this, and we cannot break backward compatibility.
Would you be more precise? If getAllVmStats returns an errCode for one of the Vms, what happens?
Looks like the only way to go is to always return success and to add a field to describe the content of the statistics (partial, complete...). THis is admittedly a far cry from the ideal solution, but it is dictated by the need to preserve the compatibility with current/old engines.
I don't think that I understand your suggestion, but it does not sound right to send a partial dictionary and to "appologize" about its paritiality elsewhere. The dictionary may be "partial" for engine-4.0 yet "complete" for engine-3.5. It's not for Vdsm to grade its own output.
Moreover, I would like to take the chance and cleanup/refactor the statistics collection. I already started adding the test infrastructure: http://gerrit.ovirt.org/#/c/26536/
To summarize, what I suggest to do is:
- fix https://bugzilla.redhat.com/show_bug.cgi?id=1073478 using a simple ugly fix like the original http://gerrit.ovirt.org/#/c/25803/1/vdsm/vm.py,cm (which I'll resubmit as new change)
- refactor and clean getStats() and friends
- on the cleaner base, properly fix the statistics collection by let getStats() always succeed and return possibly partial stats, with a new field describing the content
please note that I'm not really happy about this solution, but, given the constraint, I don't see better alternatives.
Please explain the benefits of describing the partial content, as I do not see them.
On Apr 8, 2014, at 18:57 , Dan Kenigsberg danken@redhat.com wrote:
On Tue, Apr 08, 2014 at 06:52:50AM -0400, Francesco Romani wrote:
Hello VDSM developers,
Please use devel@ovirt, and mention "vdsm" at the subject. This thread in particular involves Engine as well.
I'd like to discuss and explain the plans for cleaning up Vm.getStats() in vdsm/virt/vm.py, and how it affects a bug we have: https://bugzilla.redhat.com/show_bug.cgi?id=1073478
Let's start from the bug.
To make a long story short, there is a (small) race in VDSM, probably introduced by commit 8fedf8bde3c28edb07add23c3e9b72681cb48e49 The race can actually be triggered only if the VM is shutting down, so it is not that bad.
Fixing the reported issue in the specific case can be done with a trivial if, and that it what I did in my initial http://gerrit.ovirt.org/#/c/25803/1/vdsm/vm.py,cm
Could you explain how an AttributeError there moved the VM to Down?
However, this is just a bandaid and a proper solution is needed. This is the reason why the following versions of http://gerrit.ovirt.org/#/c/25803 changed direction toward a more comprehensive approach.
And this is the core of the issue. My initial idea is to either return success and a complete, well formed statistics set, or return an error. However current engine seems to not cope properly with this, and we cannot break backward compatibility.
Would you be more precise? If getAllVmStats returns an errCode for one of the Vms, what happens?
VM moves to Unknown after some time
Looks like the only way to go is to always return success and to add a field to describe the content of the statistics (partial, complete...). THis is admittedly a far cry from the ideal solution, but it is dictated by the need to preserve the compatibility with current/old engines.
I don't think that I understand your suggestion, but it does not sound right to send a partial dictionary and to "appologize" about its paritiality elsewhere. The dictionary may be "partial" for engine-4.0 yet "complete" for engine-3.5. It's not for Vdsm to grade its own output.
Moreover, I would like to take the chance and cleanup/refactor the statistics collection. I already started adding the test infrastructure: http://gerrit.ovirt.org/#/c/26536/
To summarize, what I suggest to do is:
- fix https://bugzilla.redhat.com/show_bug.cgi?id=1073478 using a simple ugly fix like the original
http://gerrit.ovirt.org/#/c/25803/1/vdsm/vm.py,cm (which I'll resubmit as new change)
- refactor and clean getStats() and friends
- on the cleaner base, properly fix the statistics collection by let getStats() always succeed and return
possibly partial stats, with a new field describing the content
please note that I'm not really happy about this solution, but, given the constraint, I don't see better alternatives.
Please explain the benefits of describing the partial content, as I do not see them.
the main reason for me is that currently if we fail to fetch some parts of the stats we still return what we could get. E.g. try: stats.update(self.guestAgent.getGuestInfo()) except Exception: return stats
currently in engine we can only tell it's partial by not seeing the missing bits… it makes sense since for lifecycle decisions we care for the VM status, but we don't necessarily care for other stuff
Thanks, michal
On Wed, Apr 09, 2014 at 01:11:22PM +0200, Michal Skrivanek wrote:
On Apr 8, 2014, at 18:57 , Dan Kenigsberg danken@redhat.com wrote:
On Tue, Apr 08, 2014 at 06:52:50AM -0400, Francesco Romani wrote:
Hello VDSM developers,
Please use devel@ovirt, and mention "vdsm" at the subject. This thread in particular involves Engine as well.
I'd like to discuss and explain the plans for cleaning up Vm.getStats() in vdsm/virt/vm.py, and how it affects a bug we have: https://bugzilla.redhat.com/show_bug.cgi?id=1073478
Let's start from the bug.
To make a long story short, there is a (small) race in VDSM, probably introduced by commit 8fedf8bde3c28edb07add23c3e9b72681cb48e49 The race can actually be triggered only if the VM is shutting down, so it is not that bad.
Fixing the reported issue in the specific case can be done with a trivial if, and that it what I did in my initial http://gerrit.ovirt.org/#/c/25803/1/vdsm/vm.py,cm
Could you explain how an AttributeError there moved the VM to Down?
However, this is just a bandaid and a proper solution is needed. This is the reason why the following versions of http://gerrit.ovirt.org/#/c/25803 changed direction toward a more comprehensive approach.
And this is the core of the issue. My initial idea is to either return success and a complete, well formed statistics set, or return an error. However current engine seems to not cope properly with this, and we cannot break backward compatibility.
Would you be more precise? If getAllVmStats returns an errCode for one of the Vms, what happens?
VM moves to Unknown after some time
Looks like the only way to go is to always return success and to add a field to describe the content of the statistics (partial, complete...). THis is admittedly a far cry from the ideal solution, but it is dictated by the need to preserve the compatibility with current/old engines.
I don't think that I understand your suggestion, but it does not sound right to send a partial dictionary and to "appologize" about its paritiality elsewhere. The dictionary may be "partial" for engine-4.0 yet "complete" for engine-3.5. It's not for Vdsm to grade its own output.
Moreover, I would like to take the chance and cleanup/refactor the statistics collection. I already started adding the test infrastructure: http://gerrit.ovirt.org/#/c/26536/
To summarize, what I suggest to do is:
- fix https://bugzilla.redhat.com/show_bug.cgi?id=1073478 using a simple ugly fix like the original
http://gerrit.ovirt.org/#/c/25803/1/vdsm/vm.py,cm (which I'll resubmit as new change)
- refactor and clean getStats() and friends
- on the cleaner base, properly fix the statistics collection by let getStats() always succeed and return
possibly partial stats, with a new field describing the content
please note that I'm not really happy about this solution, but, given the constraint, I don't see better alternatives.
Please explain the benefits of describing the partial content, as I do not see them.
the main reason for me is that currently if we fail to fetch some parts of the stats we still return what we could get. E.g. try: stats.update(self.guestAgent.getGuestInfo()) except Exception: return stats
currently in engine we can only tell it's partial by not seeing the missing bits… it makes sense since for lifecycle decisions we care for the VM status, but we don't necessarily care for other stuff
Engine knows that the data is partial, since it sees only parts of it. What is the benefit of Vdsm declaring it as partial?
On Apr 9, 2014, at 14:21 , Dan Kenigsberg danken@redhat.com wrote:
On Wed, Apr 09, 2014 at 01:11:22PM +0200, Michal Skrivanek wrote:
On Apr 8, 2014, at 18:57 , Dan Kenigsberg danken@redhat.com wrote:
On Tue, Apr 08, 2014 at 06:52:50AM -0400, Francesco Romani wrote:
Hello VDSM developers,
Please use devel@ovirt, and mention "vdsm" at the subject. This thread in particular involves Engine as well.
I'd like to discuss and explain the plans for cleaning up Vm.getStats() in vdsm/virt/vm.py, and how it affects a bug we have: https://bugzilla.redhat.com/show_bug.cgi?id=1073478
Let's start from the bug.
To make a long story short, there is a (small) race in VDSM, probably introduced by commit 8fedf8bde3c28edb07add23c3e9b72681cb48e49 The race can actually be triggered only if the VM is shutting down, so it is not that bad.
Fixing the reported issue in the specific case can be done with a trivial if, and that it what I did in my initial http://gerrit.ovirt.org/#/c/25803/1/vdsm/vm.py,cm
Could you explain how an AttributeError there moved the VM to Down?
However, this is just a bandaid and a proper solution is needed. This is the reason why the following versions of http://gerrit.ovirt.org/#/c/25803 changed direction toward a more comprehensive approach.
And this is the core of the issue. My initial idea is to either return success and a complete, well formed statistics set, or return an error. However current engine seems to not cope properly with this, and we cannot break backward compatibility.
Would you be more precise? If getAllVmStats returns an errCode for one of the Vms, what happens?
VM moves to Unknown after some time
Looks like the only way to go is to always return success and to add a field to describe the content of the statistics (partial, complete...). THis is admittedly a far cry from the ideal solution, but it is dictated by the need to preserve the compatibility with current/old engines.
I don't think that I understand your suggestion, but it does not sound right to send a partial dictionary and to "appologize" about its paritiality elsewhere. The dictionary may be "partial" for engine-4.0 yet "complete" for engine-3.5. It's not for Vdsm to grade its own output.
Moreover, I would like to take the chance and cleanup/refactor the statistics collection. I already started adding the test infrastructure: http://gerrit.ovirt.org/#/c/26536/
To summarize, what I suggest to do is:
- fix https://bugzilla.redhat.com/show_bug.cgi?id=1073478 using a simple ugly fix like the original
http://gerrit.ovirt.org/#/c/25803/1/vdsm/vm.py,cm (which I'll resubmit as new change)
- refactor and clean getStats() and friends
- on the cleaner base, properly fix the statistics collection by let getStats() always succeed and return
possibly partial stats, with a new field describing the content
please note that I'm not really happy about this solution, but, given the constraint, I don't see better alternatives.
Please explain the benefits of describing the partial content, as I do not see them.
the main reason for me is that currently if we fail to fetch some parts of the stats we still return what we could get. E.g. try: stats.update(self.guestAgent.getGuestInfo()) except Exception: return stats
currently in engine we can only tell it's partial by not seeing the missing bits… it makes sense since for lifecycle decisions we care for the VM status, but we don't necessarily care for other stuff
Engine knows that the data is partial, since it sees only parts of it. What is the benefit of Vdsm declaring it as partial?
there are a lot of variable fields so it's not that easy to say if it's partial or not. It does make sense to let "someone" know that there is an issue and you're getting partial data, yet you don't want to make any automatic actions...
On Wed, Apr 09, 2014 at 02:32:57PM +0200, Michal Skrivanek wrote:
On Apr 9, 2014, at 14:21 , Dan Kenigsberg danken@redhat.com wrote:
On Wed, Apr 09, 2014 at 01:11:22PM +0200, Michal Skrivanek wrote:
On Apr 8, 2014, at 18:57 , Dan Kenigsberg danken@redhat.com wrote:
On Tue, Apr 08, 2014 at 06:52:50AM -0400, Francesco Romani wrote:
Hello VDSM developers,
Please use devel@ovirt, and mention "vdsm" at the subject. This thread in particular involves Engine as well.
I'd like to discuss and explain the plans for cleaning up Vm.getStats() in vdsm/virt/vm.py, and how it affects a bug we have: https://bugzilla.redhat.com/show_bug.cgi?id=1073478
Let's start from the bug.
To make a long story short, there is a (small) race in VDSM, probably introduced by commit 8fedf8bde3c28edb07add23c3e9b72681cb48e49 The race can actually be triggered only if the VM is shutting down, so it is not that bad.
Fixing the reported issue in the specific case can be done with a trivial if, and that it what I did in my initial http://gerrit.ovirt.org/#/c/25803/1/vdsm/vm.py,cm
Could you explain how an AttributeError there moved the VM to Down?
However, this is just a bandaid and a proper solution is needed. This is the reason why the following versions of http://gerrit.ovirt.org/#/c/25803 changed direction toward a more comprehensive approach.
And this is the core of the issue. My initial idea is to either return success and a complete, well formed statistics set, or return an error. However current engine seems to not cope properly with this, and we cannot break backward compatibility.
Would you be more precise? If getAllVmStats returns an errCode for one of the Vms, what happens?
VM moves to Unknown after some time
Looks like the only way to go is to always return success and to add a field to describe the content of the statistics (partial, complete...). THis is admittedly a far cry from the ideal solution, but it is dictated by the need to preserve the compatibility with current/old engines.
I don't think that I understand your suggestion, but it does not sound right to send a partial dictionary and to "appologize" about its paritiality elsewhere. The dictionary may be "partial" for engine-4.0 yet "complete" for engine-3.5. It's not for Vdsm to grade its own output.
Moreover, I would like to take the chance and cleanup/refactor the statistics collection. I already started adding the test infrastructure: http://gerrit.ovirt.org/#/c/26536/
To summarize, what I suggest to do is:
- fix https://bugzilla.redhat.com/show_bug.cgi?id=1073478 using a simple ugly fix like the original
http://gerrit.ovirt.org/#/c/25803/1/vdsm/vm.py,cm (which I'll resubmit as new change)
- refactor and clean getStats() and friends
- on the cleaner base, properly fix the statistics collection by let getStats() always succeed and return
possibly partial stats, with a new field describing the content
please note that I'm not really happy about this solution, but, given the constraint, I don't see better alternatives.
Please explain the benefits of describing the partial content, as I do not see them.
the main reason for me is that currently if we fail to fetch some parts of the stats we still return what we could get. E.g. try: stats.update(self.guestAgent.getGuestInfo()) except Exception: return stats
currently in engine we can only tell it's partial by not seeing the missing bits… it makes sense since for lifecycle decisions we care for the VM status, but we don't necessarily care for other stuff
Engine knows that the data is partial, since it sees only parts of it. What is the benefit of Vdsm declaring it as partial?
there are a lot of variable fields so it's not that easy to say if it's partial or not. It does make sense to let "someone" know that there is an issue and you're getting partial data, yet you don't want to make any automatic actions...
But Vdsm cannot make this decision. Soon, Vdsm is to report the host's boot time. Now assume that Vdsm fails to do so. Is the stats "partial"? It's partial for engine-3.5, but it's complete for engine-3.4.
Vdsm should tell as much of the truth that it knows.
We could extend the "alerts" mechanism to report non-lethal errors in getVmStats (we currently have it only in for storage domain status), where Engine is told what's missing and why. I'm not sure if this is really needed, though.
----- Original Message -----
From: "Dan Kenigsberg" danken@redhat.com To: "Francesco Romani" fromani@redhat.com Cc: "vdsm-devel" vdsm-devel@lists.fedorahosted.org, devel@ovirt.org Sent: Tuesday, April 8, 2014 6:57:15 PM Subject: Re: cleaning statistics retrieval
On Tue, Apr 08, 2014 at 06:52:50AM -0400, Francesco Romani wrote:
Hello VDSM developers,
Please use devel@ovirt, and mention "vdsm" at the subject. This thread in particular involves Engine as well.
Right, I forgot. Sorry about that.
I'd like to discuss and explain the plans for cleaning up Vm.getStats() in vdsm/virt/vm.py, and how it affects a bug we have: https://bugzilla.redhat.com/show_bug.cgi?id=1073478
Let's start from the bug.
To make a long story short, there is a (small) race in VDSM, probably introduced by commit 8fedf8bde3c28edb07add23c3e9b72681cb48e49 The race can actually be triggered only if the VM is shutting down, so it is not that bad.
Fixing the reported issue in the specific case can be done with a trivial if, and that it what I did in my initial http://gerrit.ovirt.org/#/c/25803/1/vdsm/vm.py,cm
Could you explain how an AttributeError there moved the VM to Down?
This should actually be this bug of engine https://bugzilla.redhat.com/show_bug.cgi?id=1072282 if GetVmStats fails for whatever reason, engine thinks the VM is down.
And this is the core of the issue. My initial idea is to either return success and a complete, well formed statistics set, or return an error. However current engine seems to not cope properly with this, and we cannot break backward compatibility.
Would you be more precise? If getAllVmStats returns an errCode for one of the Vms, what happens?
Of course.
For GetAllVmStats, AFAIK, but please correct me if I am wrong, because I'm not really expert on engine side, engine simply does not expects anything different from a list of either a RunningVmStats or an ExitedVmStats.
So not sure (will verify just after this mail) if engine can cope with mixed content, meaning [ stats, errCode, stats, stats ... ]
For GetVmStats, like Michal said, the engine expects the call to succeed otherwise it puts the VM into the Unknown state.
Looks like the only way to go is to always return success and to add a field to describe the content of the statistics (partial, complete...). THis is admittedly a far cry from the ideal solution, but it is dictated by the need to preserve the compatibility with current/old engines.
I don't think that I understand your suggestion, but it does not sound right to send a partial dictionary and to "appologize" about its paritiality elsewhere. The dictionary may be "partial" for engine-4.0 yet "complete" for engine-3.5. It's not for Vdsm to grade its own output.
I see your point (that's one of the reasons I'm not happy about this solution). Please see below for the detauls.
please note that I'm not really happy about this solution, but, given the constraint, I don't see better alternatives.
Please explain the benefits of describing the partial content, as I do not see them.
The root issue here is the getStats must always succeed, because the engine doesn't expect otherwise and thus we guarantee this to cope with old engines; but inside VDSM, getStats can actually fail in a lot of places (like in this case getBalloonInfo)
So, in VDSM we can end up with a partial response, and now the question is: what to do with this partial response? And if there are optional fields in the response, how can the engine distinguish between
* full response, but with an optional field missing
and
* partial response (because of an exception inside VDSM), without some field, incidentally including an optional one
?
The VDSM 'grading' was an hint from VDSM to help engine to distinguish between those cases.
Even if we agree this grading idea is bad, the core issue remains open: what to do if we end up with a partial response? For example, let's say we handle the getBalloonInfo exception (http://gerrit.ovirt.org/#/c/26539/), the stats object to be returned will lack
* the (mandatory, expected) balloon stats * the (optional) migration progress - ok, bad example because this makes sense only during migrations, but other optional fields may be added later and the issue remains
Again, anyone feel free to correct me if I misunderstood something about engine (or VDSM <=> engine communication) and to suggest better alternatives :\
Thanks and bests,
On Wed, Apr 09, 2014 at 08:40:24AM -0400, Francesco Romani wrote:
----- Original Message -----
From: "Dan Kenigsberg" danken@redhat.com To: "Francesco Romani" fromani@redhat.com Cc: "vdsm-devel" vdsm-devel@lists.fedorahosted.org, devel@ovirt.org Sent: Tuesday, April 8, 2014 6:57:15 PM Subject: Re: cleaning statistics retrieval
On Tue, Apr 08, 2014 at 06:52:50AM -0400, Francesco Romani wrote:
Hello VDSM developers,
Please use devel@ovirt, and mention "vdsm" at the subject. This thread in particular involves Engine as well.
Right, I forgot. Sorry about that.
I'd like to discuss and explain the plans for cleaning up Vm.getStats() in vdsm/virt/vm.py, and how it affects a bug we have: https://bugzilla.redhat.com/show_bug.cgi?id=1073478
Let's start from the bug.
To make a long story short, there is a (small) race in VDSM, probably introduced by commit 8fedf8bde3c28edb07add23c3e9b72681cb48e49 The race can actually be triggered only if the VM is shutting down, so it is not that bad.
Fixing the reported issue in the specific case can be done with a trivial if, and that it what I did in my initial http://gerrit.ovirt.org/#/c/25803/1/vdsm/vm.py,cm
Could you explain how an AttributeError there moved the VM to Down?
This should actually be this bug of engine https://bugzilla.redhat.com/show_bug.cgi?id=1072282 if GetVmStats fails for whatever reason, engine thinks the VM is down.
Could you rename the Vdsm bug to exprese the in-vdsm problem, and make clear that it confuses older Engines to think the Vm is Down?
And this is the core of the issue. My initial idea is to either return success and a complete, well formed statistics set, or return an error. However current engine seems to not cope properly with this, and we cannot break backward compatibility.
Would you be more precise? If getAllVmStats returns an errCode for one of the Vms, what happens?
Of course.
For GetAllVmStats, AFAIK, but please correct me if I am wrong, because I'm not really expert on engine side, engine simply does not expects anything different from a list of either a RunningVmStats or an ExitedVmStats.
So not sure (will verify just after this mail) if engine can cope with mixed content, meaning [ stats, errCode, stats, stats ... ]
For GetVmStats, like Michal said, the engine expects the call to succeed otherwise it puts the VM into the Unknown state.
Looks like the only way to go is to always return success and to add a field to describe the content of the statistics (partial, complete...). THis is admittedly a far cry from the ideal solution, but it is dictated by the need to preserve the compatibility with current/old engines.
I don't think that I understand your suggestion, but it does not sound right to send a partial dictionary and to "appologize" about its paritiality elsewhere. The dictionary may be "partial" for engine-4.0 yet "complete" for engine-3.5. It's not for Vdsm to grade its own output.
I see your point (that's one of the reasons I'm not happy about this solution). Please see below for the detauls.
please note that I'm not really happy about this solution, but, given the constraint, I don't see better alternatives.
Please explain the benefits of describing the partial content, as I do not see them.
The root issue here is the getStats must always succeed, because the engine doesn't expect otherwise and thus we guarantee this to cope with old engines; but inside VDSM, getStats can actually fail in a lot of places (like in this case getBalloonInfo)
So, in VDSM we can end up with a partial response, and now the question is: what to do with this partial response? And if there are optional fields in the response, how can the engine distinguish between
- full response, but with an optional field missing
and
- partial response (because of an exception inside VDSM), without some field, incidentally including an optional one
?
The VDSM 'grading' was an hint from VDSM to help engine to distinguish between those cases.
Even if we agree this grading idea is bad, the core issue remains open: what to do if we end up with a partial response? For example, let's say we handle the getBalloonInfo exception (http://gerrit.ovirt.org/#/c/26539/), the stats object to be returned will lack
- the (mandatory, expected) balloon stats
- the (optional) migration progress - ok, bad example because this makes sense only during migrations, but other optional fields may be added later and the issue remains
Again, anyone feel free to correct me if I misunderstood something about engine (or VDSM <=> engine communication) and to suggest better alternatives :\
Currently, we have way too many try-except-Exception clauses in our code. They swallow everything: from expected libvirt errors to unexpected syntax errors. We should eliminate them, not add more.
Mandatory stuff must be reported, so if we fail extracting them we'd better explode and raise an error. Optional stuff are optional, so we could drop them from the output, in order to report the mandatory ones.
http://gerrit.ovirt.org/#/c/26539/2/vdsm/virt/vm.py currently suggest that Vdsm lie when it fails to extract the current ballon size, and say that it's 0. I'd prefer to drop balloon_cur from the return value of _getBalloonInfo or drop balloonInfo from the reported stats.
My only question is the granularity of the definition: is balloonInfo atomic, or can it be reported without balloon_cur? Should it? I'd prefer to have the "atoms" as big as possible, to limit the number of combinations - if self._dom is None, don't report balloonInfo at all, and so if the libvirt connection timed out.
On Apr 10, 2014, at 12:25 , Dan Kenigsberg danken@redhat.com wrote:
On Wed, Apr 09, 2014 at 08:40:24AM -0400, Francesco Romani wrote:
----- Original Message -----
From: "Dan Kenigsberg" danken@redhat.com To: "Francesco Romani" fromani@redhat.com Cc: "vdsm-devel" vdsm-devel@lists.fedorahosted.org, devel@ovirt.org Sent: Tuesday, April 8, 2014 6:57:15 PM Subject: Re: cleaning statistics retrieval
On Tue, Apr 08, 2014 at 06:52:50AM -0400, Francesco Romani wrote:
Hello VDSM developers,
Please use devel@ovirt, and mention "vdsm" at the subject. This thread in particular involves Engine as well.
Right, I forgot. Sorry about that.
I'd like to discuss and explain the plans for cleaning up Vm.getStats() in vdsm/virt/vm.py, and how it affects a bug we have: https://bugzilla.redhat.com/show_bug.cgi?id=1073478
Let's start from the bug.
To make a long story short, there is a (small) race in VDSM, probably introduced by commit 8fedf8bde3c28edb07add23c3e9b72681cb48e49 The race can actually be triggered only if the VM is shutting down, so it is not that bad.
Fixing the reported issue in the specific case can be done with a trivial if, and that it what I did in my initial http://gerrit.ovirt.org/#/c/25803/1/vdsm/vm.py,cm
Could you explain how an AttributeError there moved the VM to Down?
This should actually be this bug of engine https://bugzilla.redhat.com/show_bug.cgi?id=1072282 if GetVmStats fails for whatever reason, engine thinks the VM is down.
Could you rename the Vdsm bug to exprese the in-vdsm problem, and make clear that it confuses older Engines to think the Vm is Down?
And this is the core of the issue. My initial idea is to either return success and a complete, well formed statistics set, or return an error. However current engine seems to not cope properly with this, and we cannot break backward compatibility.
Would you be more precise? If getAllVmStats returns an errCode for one of the Vms, what happens?
Of course.
For GetAllVmStats, AFAIK, but please correct me if I am wrong, because I'm not really expert on engine side, engine simply does not expects anything different from a list of either a RunningVmStats or an ExitedVmStats.
So not sure (will verify just after this mail) if engine can cope with mixed content, meaning [ stats, errCode, stats, stats ... ]
For GetVmStats, like Michal said, the engine expects the call to succeed otherwise it puts the VM into the Unknown state.
Looks like the only way to go is to always return success and to add a field to describe the content of the statistics (partial, complete...). THis is admittedly a far cry from the ideal solution, but it is dictated by the need to preserve the compatibility with current/old engines.
I don't think that I understand your suggestion, but it does not sound right to send a partial dictionary and to "appologize" about its paritiality elsewhere. The dictionary may be "partial" for engine-4.0 yet "complete" for engine-3.5. It's not for Vdsm to grade its own output.
I see your point (that's one of the reasons I'm not happy about this solution). Please see below for the detauls.
please note that I'm not really happy about this solution, but, given the constraint, I don't see better alternatives.
Please explain the benefits of describing the partial content, as I do not see them.
The root issue here is the getStats must always succeed, because the engine doesn't expect otherwise and thus we guarantee this to cope with old engines; but inside VDSM, getStats can actually fail in a lot of places (like in this case getBalloonInfo)
So, in VDSM we can end up with a partial response, and now the question is: what to do with this partial response? And if there are optional fields in the response, how can the engine distinguish between
- full response, but with an optional field missing
and
- partial response (because of an exception inside VDSM),
without some field, incidentally including an optional one
?
The VDSM 'grading' was an hint from VDSM to help engine to distinguish between those cases.
Even if we agree this grading idea is bad, the core issue remains open: what to do if we end up with a partial response? For example, let's say we handle the getBalloonInfo exception (http://gerrit.ovirt.org/#/c/26539/), the stats object to be returned will lack
- the (mandatory, expected) balloon stats
- the (optional) migration progress - ok, bad example because this makes sense only during migrations,
but other optional fields may be added later and the issue remains
Again, anyone feel free to correct me if I misunderstood something about engine (or VDSM <=> engine communication) and to suggest better alternatives :\
Currently, we have way too many try-except-Exception clauses in our code. They swallow everything: from expected libvirt errors to unexpected syntax errors. We should eliminate them, not add more.
Mandatory stuff must be reported, so if we fail extracting them we'd better explode and raise an error. Optional stuff are optional, so we could drop them from the output, in order to report the mandatory ones.
http://gerrit.ovirt.org/#/c/26539/2/vdsm/virt/vm.py currently suggest that Vdsm lie when it fails to extract the current ballon size, and say that it's 0. I'd prefer to drop balloon_cur from the return value of _getBalloonInfo or drop balloonInfo from the reported stats.
it does make a lot of sense….but my only worry is the backward compatibility. E.g. for the balloon the current engine code,AFAICT, thinks that the device is not present when we exclude balloonInfo, whereas reporting 0 in balloon_cur still means the balloon is still there I'm not sure if we can ever find out all possible regressions all the way to 3.0
Then we are "just" speaking about the 4.0 solution.
My only question is the granularity of the definition: is balloonInfo atomic, or can it be reported without balloon_cur? Should it? I'd prefer to have the "atoms" as big as possible, to limit the number of combinations - if self._dom is None, don't report balloonInfo at all, and so if the libvirt connection timed out. _______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
----- Original Message -----
From: "Dan Kenigsberg" danken@redhat.com To: "Francesco Romani" fromani@redhat.com Cc: devel@ovirt.org, "vdsm-devel" vdsm-devel@lists.fedorahosted.org Sent: Thursday, April 10, 2014 12:25:33 PM Subject: Re: [VDSM] cleaning statistics retrieval
[...]
Could you explain how an AttributeError there moved the VM to Down?
This should actually be this bug of engine https://bugzilla.redhat.com/show_bug.cgi?id=1072282 if GetVmStats fails for whatever reason, engine thinks the VM is down.
Could you rename the Vdsm bug to exprese the in-vdsm problem, and make clear that it confuses older Engines to think the Vm is Down?
Done
[...]
The VDSM 'grading' was an hint from VDSM to help engine to distinguish between those cases.
Even if we agree this grading idea is bad, the core issue remains open: what to do if we end up with a partial response? For example, let's say we handle the getBalloonInfo exception (http://gerrit.ovirt.org/#/c/26539/), the stats object to be returned will lack
- the (mandatory, expected) balloon stats
- the (optional) migration progress - ok, bad example because this makes
sense only during migrations, but other optional fields may be added later and the issue remains
Again, anyone feel free to correct me if I misunderstood something about engine (or VDSM <=> engine communication) and to suggest better alternatives :\
Currently, we have way too many try-except-Exception clauses in our code. They swallow everything: from expected libvirt errors to unexpected syntax errors. We should eliminate them, not add more.
Mandatory stuff must be reported, so if we fail extracting them we'd better explode and raise an error. Optional stuff are optional, so we could drop them from the output, in order to report the mandatory ones.
http://gerrit.ovirt.org/#/c/26539/2/vdsm/virt/vm.py currently suggest that Vdsm lie when it fails to extract the current ballon size, and say that it's 0. I'd prefer to drop balloon_cur from the return value of _getBalloonInfo or drop balloonInfo from the reported stats.
My only question is the granularity of the definition: is balloonInfo atomic, or can it be reported without balloon_cur? Should it? I'd prefer to have the "atoms" as big as possible, to limit the number of combinations - if self._dom is None, don't report balloonInfo at all, and so if the libvirt connection timed out.
OK, I'll amend http://gerrit.ovirt.org/#/c/26539/ accordingly (and of course taking in account Michal's remarks).
I posted previously a tentative getStats cleanup in the form of this patch series http://gerrit.ovirt.org/#/c/26547/ (13 patches but quite fine-grained), in light of what we discussed on this thread I consider them obsolete and I'm going to abandon them.
Bests,
vdsm-devel@lists.stg.fedorahosted.org