[I'm not subscribed to this list, please keep me in CC.]
Heya,
A little while ago, we (Matthew Miller, myself, Attila Fazekas (upstream OpenStack developer) had an IRC discussion (on #openstack-qa, Freenode) with OpenStack upstream CI infrastructure folks about their concerns for continuing to have Fedora as a default to run as CI voting guest (Nova instance). They (mostly Sean Dague - a major upstream OpenStack contributor who voiced these) outlined a few issues:
1. It's not possible to download from the fedora infrastructure reliably - 10% failure rate from their cloud providers (HP and RAX). - About this point, when mattdm inquired - "is the failure in hitting the fedora mirrors or fedora core infrastructure?", their response - "I don't fully know, I think going through the url we are using we get bounced to mirrors".
2. There are possibly issues with the normal upstream fedora image that could be fixed with custom respin. - NOTE: I'm doubtful of this idea, as existing Fedora cloud images itself are not really extensively tested. I'd think focusing on _official_ cloud images and having a solid set of tests so that it can be consumed by cloud projects (OpenStack, etc).
- Having a custom respin means that we're off the main path for testing of the image -- which again needs _some_ level of assurance that it can be used in a higher-level cloud project's CI infr- which again needs _some_ level of assurance that it can be used in a higher-level cloud project's CI infra.
3. Another important point OpenStack infra folks emphasized is - these images will get 4000 test runs a week on them
Any suggestions to allay these are welcome.
-- /kashyap
On Thu, 19 Jun 2014 00:24:55 +0530 Kashyap Chamarthy kchamart@redhat.com wrote:
[I'm not subscribed to this list, please keep me in CC.]
Heya,
A little while ago, we (Matthew Miller, myself, Attila Fazekas (upstream OpenStack developer) had an IRC discussion (on #openstack-qa, Freenode) with OpenStack upstream CI infrastructure folks about their concerns for continuing to have Fedora as a default to run as CI voting guest (Nova instance). They (mostly Sean Dague - a major upstream OpenStack contributor who voiced these) outlined a few issues:
I'm not famillar with the terminology, what does a 'voting guest' mean?
- It's not possible to download from the fedora infrastructure reliably - 10% failure rate from their cloud providers (HP and RAX).
- About this point, when mattdm inquired - "is the failure in hitting the fedora mirrors or fedora core infrastructure?", their response - "I don't fully know, I think going through
the url we are using we get bounced to mirrors".
Yeah, more data would be very nice here... what url(s) they are using, what error codes if any they get back?
Are these the released cloud images? f19/20? Or nightlies or ?
How often do they download? Once a image is loaded, I am not sure why they would re-download it unless it's changed? Or unless they are grabbing nightly rawhide images?
- There are possibly issues with the normal upstream fedora image that could be fixed with custom respin.
- NOTE: I'm doubtful of this idea, as existing Fedora cloud
images itself are not really extensively tested. I'd think focusing on _official_ cloud images and having a solid set of tests so that it can be consumed by cloud projects (OpenStack, etc).
- Having a custom respin means that we're off the main path for testing of the image -- which again needs _some_ level of assurance that it can be used in a higher-level cloud
project's CI infr- which again needs _some_ level of assurance that it can be used in a higher-level cloud project's CI infra.
Yeah, I would think we would like to avoid that... and try and merge in the changes they need for images instead of them going and making their own that only they use.
- Another important point OpenStack infra folks emphasized is -
these images will get 4000 test runs a week on them
Cool.
Any suggestions to allay these are welcome.
Happy to try and solve any bottlenecks they are having...
kevin
On Thu, Jun 19, 2014 at 09:20:14AM -0600, Kevin Fenzi wrote:
On Thu, 19 Jun 2014 00:24:55 +0530 Kashyap Chamarthy kchamart@redhat.com wrote:
[I'm not subscribed to this list, please keep me in CC.]
Heya,
A little while ago, we (Matthew Miller, myself, Attila Fazekas (upstream OpenStack developer) had an IRC discussion (on #openstack-qa, Freenode) with OpenStack upstream CI infrastructure folks about their concerns for continuing to have Fedora as a default to run as CI voting guest (Nova instance). They (mostly Sean Dague - a major upstream OpenStack contributor who voiced these) outlined a few issues:
I'm not famillar with the terminology, what does a 'voting guest' mean?
Sorry for being unclear. It means, any proposed OpenStack change/patch has to be executed on a Fedora virtual machine too, only once it passes the tests on Fedora, patches will be merged to upstream git. I cc'd Attila, he can correct me if I said something wrong.
- It's not possible to download from the fedora infrastructure reliably - 10% failure rate from their cloud providers (HP and RAX).
- About this point, when mattdm inquired - "is the failure in hitting the fedora mirrors or fedora core infrastructure?", their response - "I don't fully know, I think going through
the url we are using we get bounced to mirrors".
Yeah, more data would be very nice here... what url(s) they are using, what error codes if any they get back?
Looking at the script[1] that creates the CI VM, it uses this URL -- https://dl.fedoraproject.org/pub/fedora/linux/releases/20/Images/x86_64/Fedo...
[1] https://github.com/openstack-dev/devstack/blob/master/stackrc#L353
Are these the released cloud images? f19/20? Or nightlies or ?
Released, official images.
How often do they download? Once a image is loaded, I am not sure why they would re-download it unless it's changed?
I just confirmed, they (CI infra) download and cache it. But, once every 24 hours, they rebuild the caches. It's the humans that download it manually (without any caching environment) that face the bottlenecks they say.
Or unless they are grabbing nightly rawhide images?
They won't prefer to do this as only distribution tested image will be used used in OpenStack CI environment.
- There are possibly issues with the normal upstream fedora image that could be fixed with custom respin.
- NOTE: I'm doubtful of this idea, as existing Fedora cloud
images itself are not really extensively tested. I'd think focusing on _official_ cloud images and having a solid set of tests so that it can be consumed by cloud projects (OpenStack, etc).
- Having a custom respin means that we're off the main path for testing of the image -- which again needs _some_ level of assurance that it can be used in a higher-level cloud
project's CI infr- which again needs _some_ level of assurance that it can be used in a higher-level cloud project's CI infra.
Yeah, I would think we would like to avoid that... and try and merge in the changes they need for images instead of them going and making their own that only they use.
Oh, it's my poor wording, they didn't mean to say _they'd_ create these custom images. OpenStack infra is clear - they'd only use reasonably well-tested imges from Distributions.
- Another important point OpenStack infra folks emphasized is -
these images will get 4000 test runs a week on them
Cool.
Any suggestions to allay these are welcome.
Happy to try and solve any bottlenecks they are having...
Yeah, folks are testing more than ever with Fedora lately.
OpenStack infra/qa folks have an upcoming meet up discuss several, Fedora is also on their topic. Will let you know if they provide more specific, technical feedback from OpenStack infra.
Thanks.
On Fri, 20 Jun 2014 09:25:17 +0530 Kashyap Chamarthy kchamart@redhat.com wrote:
I'm not famillar with the terminology, what does a 'voting guest' mean?
Sorry for being unclear. It means, any proposed OpenStack change/patch has to be executed on a Fedora virtual machine too, only once it passes the tests on Fedora, patches will be merged to upstream git. I cc'd Attila, he can correct me if I said something wrong.
Ah, interesting. Cool.
Looking at the script[1] that creates the CI VM, it uses this URL -- https://dl.fedoraproject.org/pub/fedora/linux/releases/20/Images/x86_64/Fedo...
[1] https://github.com/openstack-dev/devstack/blob/master/stackrc#L353
Strange. Thats going against our master mirrors. They shouldn't have any downtime or problems.
I would be very interested in any error output from when these downloads fail.
Are these the released cloud images? f19/20? Or nightlies or ?
Released, official images.
ok.
How often do they download? Once a image is loaded, I am not sure why they would re-download it unless it's changed?
I just confirmed, they (CI infra) download and cache it. But, once every 24 hours, they rebuild the caches. It's the humans that download it manually (without any caching environment) that face the bottlenecks they say.
ok. I'd love to hear what error(s) they see so we can clear them up.
Or unless they are grabbing nightly rawhide images?
They won't prefer to do this as only distribution tested image will be used used in OpenStack CI environment.
Fair enough.
kevin
On Fri, Jun 20, 2014 at 09:57:39AM -0600, Kevin Fenzi wrote:
I just confirmed, they (CI infra) download and cache it. But, once every 24 hours, they rebuild the caches. It's the humans that download it manually (without any caching environment) that face the bottlenecks they say.
ok. I'd love to hear what error(s) they see so we can clear them up.
Sean Dague also said that there were frequent *boot* errors (apparently unrelated to getting the images or infrastructure.) I'd obviously also like to get some hard information on that as well.
----- Original Message -----
From: "Matthew Miller" mattdm@fedoraproject.org To: "Kevin Fenzi" kevin@scrye.com Cc: "Kashyap Chamarthy" kchamart@redhat.com, infrastructure@lists.fedoraproject.org, afazekas@redhat.com Sent: Friday, June 20, 2014 6:05:28 PM Subject: Re: Reliability of Fedora infrastructure to download cloud images
On Fri, Jun 20, 2014 at 09:57:39AM -0600, Kevin Fenzi wrote:
I just confirmed, they (CI infra) download and cache it. But, once every 24 hours, they rebuild the caches. It's the humans that download it manually (without any caching environment) that face the bottlenecks they say.
ok. I'd love to hear what error(s) they see so we can clear them up.
Sean Dague also said that there were frequent *boot* errors (apparently unrelated to getting the images or infrastructure.) I'd obviously also like to get some hard information on that as well.
I still not have enough information for a bug report, I do not know even is it the L1 (qemu) or the L2 (kernel) guest failure.
L2 SMP guest (without nested guest support (with nesting I haven't see the issue so far)) can have I/O related issues when the L2 guest has more than 1 vcpu, and the L1 guest is on a low latency drive 'unsafe' cache enabled.
Now the Opnestack gate using 1 VCPU with the L2 F20, no issue seen since that.
I am going to try to reproduce the issue on full Fedora system. (normally I have enabled nested virt on F20 L0).
The issue can be seen on F20 L1+L2, with el6 L0.
-- Matthew Miller mattdm@fedoraproject.org Fedora Project Leader
On Fri, Jun 20, 2014 at 09:57:39AM -0600, Kevin Fenzi wrote:
On Fri, 20 Jun 2014 09:25:17 +0530 Kashyap Chamarthy kchamart@redhat.com wrote:
I'm not famillar with the terminology, what does a 'voting guest' mean?
Sorry for being unclear. It means, any proposed OpenStack change/patch has to be executed on a Fedora virtual machine too, only once it passes the tests on Fedora, patches will be merged to upstream git. I cc'd Attila, he can correct me if I said something wrong.
Ah, interesting. Cool.
That said, currently due to hardware resourcing/allocation of VM issues constraints Fedora jobs are made "experimental" (meaning, someone has to add an explicit "check experimental" comment in Gerrit to test on Fedora). That said, we do have frequent testers on Fedora.
Good news is -- this is just temporary. Ian Wienand is steadfastly working (thanks!) with OpenStack infrastructure to fix the various distribution allocation issues in the infrastucture[1]. For the technically curious, Ian explains the node allocation (in OpenStack CI infra) in very clear detail here[2].
[1] https://review.openstack.org/#/c/101110/ [2] http://lists.openstack.org/pipermail/openstack-infra/2014-June/001358.html
Looking at the script[1] that creates the CI VM, it uses this URL -- https://dl.fedoraproject.org/pub/fedora/linux/releases/20/Images/x86_64/Fedo...
[1] https://github.com/openstack-dev/devstack/blob/master/stackrc#L353
Strange. Thats going against our master mirrors. They shouldn't have any downtime or problems.
I would be very interested in any error output from when these downloads fail.
Ian informed this was an issue from infrastructure hosting providers to upstream OpenStack. So, not an issue from Fedora's infra side.
Are these the released cloud images? f19/20? Or nightlies or ?
Released, official images.
ok.
How often do they download? Once a image is loaded, I am not sure why they would re-download it unless it's changed?
I just confirmed, they (CI infra) download and cache it. But, once every 24 hours, they rebuild the caches. It's the humans that download it manually (without any caching environment) that face the bottlenecks they say.
ok. I'd love to hear what error(s) they see so we can clear them up.
Yes, I'm lurking on the IRC/mailing list, so if anyone brings up issues, I'll notify here.
Or unless they are grabbing nightly rawhide images?
They won't prefer to do this as only distribution tested image will be used used in OpenStack CI environment.
Fair enough.
Thanks Kevin (and Matthew) for your quick responses, as usual.
----- Original Message -----
From: "Kashyap Chamarthy" kchamart@redhat.com To: "Kevin Fenzi" kevin@scrye.com Cc: infrastructure@lists.fedoraproject.org, mattdm@fedoraproject.org, afazekas@redhat.com Sent: Friday, June 20, 2014 5:55:17 AM Subject: Re: Reliability of Fedora infrastructure to download cloud images
On Thu, Jun 19, 2014 at 09:20:14AM -0600, Kevin Fenzi wrote:
On Thu, 19 Jun 2014 00:24:55 +0530 Kashyap Chamarthy kchamart@redhat.com wrote:
[I'm not subscribed to this list, please keep me in CC.]
Heya,
A little while ago, we (Matthew Miller, myself, Attila Fazekas (upstream OpenStack developer) had an IRC discussion (on #openstack-qa, Freenode) with OpenStack upstream CI infrastructure folks about their concerns for continuing to have Fedora as a default to run as CI voting guest (Nova instance). They (mostly Sean Dague - a major upstream OpenStack contributor who voiced these) outlined a few issues:
I'm not famillar with the terminology, what does a 'voting guest' mean?
Sorry for being unclear. It means, any proposed OpenStack change/patch has to be executed on a Fedora virtual machine too, only once it passes the tests on Fedora, patches will be merged to upstream git. I cc'd Attila, he can correct me if I said something wrong.
If the job is voting on the gate pipeline it can prevent incompatible changes.
- It's not possible to download from the fedora infrastructure reliably - 10% failure rate from their cloud providers (HP and RAX).
- About this point, when mattdm inquired - "is the failure in hitting the fedora mirrors or fedora core infrastructure?", their response - "I don't fully know, I think going through
the url we are using we get bounced to mirrors".
Yeah, more data would be very nice here... what url(s) they are using, what error codes if any they get back?
I saw the image download failure at least once, but I cannot find the pattern for the failure :(. IMHO it was less than 10% failure rate, but open-stack infra/QA notices issues above 0.1% failure rate.
If I or anyone see the failure pattern again he can add a query to the http://status.openstack.org/elastic-recheck/. In this case we would know how much issues happens exactly.
Anyone who sign the Openstack contributor agreement, can propose queries to the repo: https://github.com/openstack-infra/elastic-recheck/tree/master/queries
Here are the image download urls: https://github.com/openstack-dev/devstack/blob/master/stackrc#L357
Looking at the script[1] that creates the CI VM, it uses this URL -- https://dl.fedoraproject.org/pub/fedora/linux/releases/20/Images/x86_64/Fedo...
[1] https://github.com/openstack-dev/devstack/blob/master/stackrc#L353
Are these the released cloud images? f19/20? Or nightlies or ?
Released, official images.
How often do they download? Once a image is loaded, I am not sure why they would re-download it unless it's changed?
I just confirmed, they (CI infra) download and cache it. But, once every 24 hours, they rebuild the caches. It's the humans that download it manually (without any caching environment) that face the bottlenecks they say.
AFAIK every worker node downloads the L2 images once it's lifetime, I do not know what is the average lifetime of these vms. An L2 image version switch can lead to ~500 image download in 1 hour.
Or unless they are grabbing nightly rawhide images?
They won't prefer to do this as only distribution tested image will be used used in OpenStack CI environment.
- There are possibly issues with the normal upstream fedora image that could be fixed with custom respin.
- NOTE: I'm doubtful of this idea, as existing Fedora cloud
images itself are not really extensively tested. I'd think focusing on _official_ cloud images and having a solid set of tests so that it can be consumed by cloud projects (OpenStack, etc).
- Having a custom respin means that we're off the main path for testing of the image -- which again needs _some_ level of assurance that it can be used in a higher-level cloud
project's CI infr- which again needs _some_ level of assurance that it can be used in a higher-level cloud project's CI infra.
Yeah, I would think we would like to avoid that... and try and merge in the changes they need for images instead of them going and making their own that only they use.
Oh, it's my poor wording, they didn't mean to say _they'd_ create these custom images. OpenStack infra is clear - they'd only use reasonably well-tested imges from Distributions.
- Another important point OpenStack infra folks emphasized is -
these images will get 4000 test runs a week on them
Cool.
Any suggestions to allay these are welcome.
Happy to try and solve any bottlenecks they are having...
Yeah, folks are testing more than ever with Fedora lately.
OpenStack infra/qa folks have an upcoming meet up discuss several, Fedora is also on their topic. Will let you know if they provide more specific, technical feedback from OpenStack infra.
Thanks.
-- /kashyap
infrastructure@lists.fedoraproject.org