This came up in a different venue and pingou and I have continued to talk about it. Seemed that this was the right place to bring the discussion though.
Some observations:
* Pkgdb2 and a call for testing in staging was announced well in advance of the deployment to production (good) but not everyone understood that we were going to be breaking API (bad).
* There were people inside of fedora infrastructure and outside of infrastructure who were surprised by the API break. There were also some community members and infrastructure members who heeded the call for testing and both gave feedback and ported before the deployment.
* There was a FAS2 update that pkgdb2 depended upon. That was also pending in stg for a long time and also had some minor API changes (IIRC, all unintentional. I hotfixed one of them that was simply a bug last week). These also caused issues for some scripts.
* Unexpected problems: we had things that we didn't know used the pkgdb API, things that weren't tested in stg because stg couldn't replicate that part of production, and things that were ported but mistakes caused the ported scripts to not be deployed or to point at stg instead of production. I saw that we had the right people on IRC throughout the day working on analyzing and patching all of the broken things so. However, this was somewhat by accident and some of those people were surprised that they spent their day doing this.
Some ideas for doing major deployments in the future:
1: We have to make people aware when a new deployment means API breaks. * Be clear that the new deployment means API breaks in every call for testing. Send announcements to infrastructure list and depending on the service to devel list. * Have a separate announcement besides the standard outage notification that says that an API breaking update is planned for $date * When we set a date for the new deployment, discuss it at least once in a weekly infrastructure meeting. * See also the solution in #3 below
2: It would be really nice for people to do more testing in stg. * Increase rube coverage. rube does end-to-end testing so it's better at catching cross-app issues where API changes better than unittests which try to be small and self-contained - A flock session where everyone/dev in infra gets to write one rube test so we get to know the framework * Run rube daily - Could we run rube in an Xvfb on an infrastructure host? * Continue to work towards a complete replica of production in the stg environment.
3: "Mean time to repair is more important than mean time between failure." It seems like anytime there's a major update there's unexpected things that break. Let's anticipate the unexpected happening. * Explicitly plan for everyone to spend their day firefighting when we make a major new deployment. If you've already found all the places your code is affected and pre-ported it and the deployment goes smoothly then hey, you've got 6 extra working hours to shift back to doing other things. If it's not smooth, then we've planned to have the attention of the right people for the unexpected difficulties that arise. * As part of this, we need to identify people outside of infrastructure that should also be ready for breakage. Reach out to rel-eng, docs, qa, cvsadmins, etc if there's a chance that they will be affected.
4: Related to the FAS release: Buggy code happens. How can we make it happen less? * More unittests would be good however we know from experience with bodhi that unittests don't catch a lot of things that are changes in behaviour rather than true "bugs". Unexpected API changes that cause people porting pain can be as simple as returning None instead of an empty list which causes a no-op iteration in running code to fail while the unittests survive because they're checking that "no results were returned". * Pingou has championed making API calls and WebUI calls into separate URL endpoints. I think that coding style makes it easier to control bugs related to updating the webui while trying to preserve the API so we probably want to move to that model as we move onto the next major version of our apps. * Not returning json-ified versions of internal data structures (like database tables) but instead parsing the results and returning a specific structure would also help divorce internal changes from external API.
What should we apply this to? * Probably can skip if: - Things that we don't think have API breaks - Things that are minor releases (hopefully these would correlate with not having API breaks :-) - Leaf services that are not essential to releasing Fedora. + ask, nuancier, elections, easyfix, badges, paste, nuancier + There's a lot of boderline cases too -- is fedocal essential enough to warrant being under this policy? Since the wiki is used via its API should that fall under this as well?
Comments, thoughts, other ideas?
Do we need to "ratify" something like this at a meeting?
What's the next app deploy where we'll want to enact this? Maybe bodhi2 ;-)?
-Toshio
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Wed, 4 Jun 2014 11:44:54 -0700 Toshio Kuratomi a.badger@gmail.com wrote:
This came up in a different venue and pingou and I have continued to talk about it. Seemed that this was the right place to bring the discussion though.
Some observations:
- Pkgdb2 and a call for testing in staging was announced well in
advance of the deployment to production (good) but not everyone understood that we were going to be breaking API (bad).
- There were people inside of fedora infrastructure and outside of infrastructure who were surprised by the API break. There were
also some community members and infrastructure members who heeded the call for testing and both gave feedback and ported before the deployment.
- There was a FAS2 update that pkgdb2 depended upon. That was also
pending in stg for a long time and also had some minor API changes (IIRC, all unintentional. I hotfixed one of them that was simply a bug last week). These also caused issues for some scripts.
- Unexpected problems: we had things that we didn't know used the
pkgdb API, things that weren't tested in stg because stg couldn't replicate that part of production, and things that were ported but mistakes caused the ported scripts to not be deployed or to point at stg instead of production. I saw that we had the right people on IRC throughout the day working on analyzing and patching all of the broken things so. However, this was somewhat by accident and some of those people were surprised that they spent their day doing this.
Some ideas for doing major deployments in the future:
1: We have to make people aware when a new deployment means API breaks.
- Be clear that the new deployment means API breaks in every call
for testing. Send announcements to infrastructure list and depending on the service to devel list.
- Have a separate announcement besides the standard outage
notification that says that an API breaking update is planned for $date
- When we set a date for the new deployment, discuss it at least
once in a weekly infrastructure meeting.
- See also the solution in #3 below
I have a suggestion make everything provide and validate api info.
this is something koji does, though we to date have not broken api, we expect we will do when we start on koji-2.0 so each app would provide a getAPIVersion() function, then the consumers validate that they know how to talk to that API version.
for bodhi for instance apps could check for api version if it fails assume version 1 and then be able to support both version 1 and 2, when we deploy live bodhi2 the clients like fedpkg update will transparently switch over.
Dennis
On Wed, Jun 04, 2014 at 05:27:44PM -0500, Dennis Gilmore wrote:
I have a suggestion make everything provide and validate api info.
this is something koji does, though we to date have not broken api, we expect we will do when we start on koji-2.0 so each app would provide a getAPIVersion() function, then the consumers validate that they know how to talk to that API version.
for bodhi for instance apps could check for api version if it fails assume version 1 and then be able to support both version 1 and 2, when we deploy live bodhi2 the clients like fedpkg update will transparently switch over.
<nod> This would be nice. I don't think it would have helped us with the pkgdb2 update where people didn't port at all but if we gave more information it could. Ideally we'd want to see something like this:
data = service.requestAPIVersion([(1, 2), (2, 0)],
... 'arbitrary-identification-string')
print(data)
[[1, 2], [1, 4]]
In that example the client would be claiming compatibility with API version 1.2 or later and 2.0 or later.
The server would be telling the client that it can supply something compatible with 1.2, namely 1.4.
If we got clients to update their code to make requestAPIVersion() calls both the client and the server would be able to track what the other wanted and was capable of. Serverside, we could keep track of what identification strings were making requests for something other than API version 2.0 and help those people port to 2.0 before deployment.
Is that sort of API too confusing (because it's attempting to track information in both directions but the person using the code really just wants information in a single direction)?
-Toshio
On Wed, Jun 04, 2014 at 05:27:44PM -0500, Dennis Gilmore wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
this is something koji does, though we to date have not broken api, we expect we will do when we start on koji-2.0 so each app would provide a getAPIVersion() function, then the consumers validate that they know how to talk to that API version.
for bodhi for instance apps could check for api version if it fails assume version 1 and then be able to support both version 1 and 2, when we deploy live bodhi2 the clients like fedpkg update will transparently switch over.
After the call for testers in November that was one of the feedback I received on the API, so it is there: https://admin.fedoraproject.org/pkgdb/api/version Note that this is the API version which may be different from the Pkgdb2 version. With the decoupling of API and UI, I keep two different version number and as long as the first number is the same, the API is backward compatible :)
And I have been using it just yesterday for pkgdb-cli to use a new API method or fall back on the old one according to the version of the pkgdb API: https://github.com/fedora-infra/packagedb-cli/commit/af10afd22a5f4370285ee84...
Pierre
On Wed, 4 Jun 2014 11:44:54 -0700 Toshio Kuratomi a.badger@gmail.com wrote:
This came up in a different venue and pingou and I have continued to talk about it. Seemed that this was the right place to bring the discussion though.
...snip...
Some ideas for doing major deployments in the future:
1: We have to make people aware when a new deployment means API breaks.
- Be clear that the new deployment means API breaks in every call
for testing. Send announcements to infrastructure list and depending on the service to devel list.
- Have a separate announcement besides the standard outage
notification that says that an API breaking update is planned for $date
- When we set a date for the new deployment, discuss it at least
once in a weekly infrastructure meeting.
- See also the solution in #3 below
This seems good to me.
2: It would be really nice for people to do more testing in stg.
- Increase rube coverage. rube does end-to-end testing so it's
better at catching cross-app issues where API changes better than unittests which try to be small and self-contained - A flock session where everyone/dev in infra gets to write one rube test so we get to know the framework
Yeah, that sounds good. Perhaps a badge for 'added a rube test' ? :)
- Run rube daily
- Could we run rube in an Xvfb on an infrastructure host?
- Continue to work towards a complete replica of production in the
stg environment.
Yeah, if we can figure out a clean way to run it. It also needs some credentials for some of the tests, so we would need to make a test user, etc.
3: "Mean time to repair is more important than mean time between failure." It seems like anytime there's a major update there's unexpected things that break. Let's anticipate the unexpected happening.
I agree here too...
- Explicitly plan for everyone to spend their day firefighting when
we make a major new deployment. If you've already found all the places your code is affected and pre-ported it and the deployment goes smoothly then hey, you've got 6 extra working hours to shift back to doing other things. If it's not smooth, then we've planned to have the attention of the right people for the unexpected difficulties that arise.
Yep.
- As part of this, we need to identify people outside of
infrastructure that should also be ready for breakage. Reach out to rel-eng, docs, qa, cvsadmins, etc if there's a chance that they will be affected.
Agreed. ...snip...
What should we apply this to?
- Probably can skip if:
- Things that we don't think have API breaks
- Things that are minor releases (hopefully these would correlate
with not having API breaks :-)
- Leaf services that are not essential to releasing Fedora.
- ask, nuancier, elections, easyfix, badges, paste, nuancier
- There's a lot of boderline cases too -- is fedocal essential
enough to warrant being under this policy? Since the wiki is used via its API should that fall under this as well?
yeah, there's going to be some fuzz, but discussing the update in at least one infra meeting before pushing would mean we would have time/people to hash out if it's a major update or what.
Comments, thoughts, other ideas?
Do we need to "ratify" something like this at a meeting?
We could, but I think it's all quite sensable, so we could just do it unless someone objects.
What's the next app deploy where we'll want to enact this? Maybe bodhi2 ;-)?
Yep. Althought it might be something before then... we talked about trying to get bodhi2 in stg before too long, but we don't want to disrupt the release process for f21 too much, so we thought targeting landing it after f21 a few weeks might be best. That also gives lots of time for testing in stg and getting api users switched over.
kevin
infrastructure@lists.fedoraproject.org