Hi. I've written up a proposal for a way to support EPEL builds in Koji. It's not the only way we could do this, but I think it's doable with a reasonable amount of effort, and has the side-effect of greatly simplifying the Koji setup process for a lot of people (by removing the need to bootstrap/import an entire distro of packages into your private Koji instance). You can view the proposal here:
http://fedoraproject.org/wiki/Koji/EPELSupport
It's fairly detailed regarding the data model changes necessary, so if you're not familiar with the Koji codebase you can skip those parts. Questions and comments welcome.
Thanks, Mike
Mike Bonnet wrote:
Hi. I've written up a proposal for a way to support EPEL builds in Koji. It's not the only way we could do this, but I think it's doable with a reasonable amount of effort, and has the side-effect of greatly simplifying the Koji setup process for a lot of people (by removing the need to bootstrap/import an entire distro of packages into your private Koji instance). You can view the proposal here:
http://fedoraproject.org/wiki/Koji/EPELSupport
It's fairly detailed regarding the data model changes necessary, so if you're not familiar with the Koji codebase you can skip those parts. Questions and comments welcome.
Hi Mike,
good to see you've spend some time on this whereas I have been lazy in Littleton (holiday).
I'd like to share a few thoughts on the Wiki page -which is a great start;
From the Wiki page: "There is a strong feeling that if a package exists in the Koji-managed local repo (whose contents the Koji admin has full control over) it should always be preferred over the external repo (whose contents the Koji admin may have little or no control over)."
The preference koji will have (in using which package in the buildroot), might introduce the problem where customly built package foo-1.0 is used in the buildroot, and upstream updates to foo-1.1 - the running nodes would update to foo-1.1 whereas the buildroot still uses the custom foo-1.0...
The point being, that these updates have to managed as they are released. The updates need to managed on the side where said packages are being mashed into a repository (infra side) or applied (client side).
You can see the duplicate effort when the updates are managed on either side (infra or client), _and_ in koji, separately.
I would like to suggest the koji development team makes the priority setting koji is going to use a configurable item -which in compared to the bigger picture isn't all that much a priority, just something to think about.
Additionally, I'd like to comment on / ask about the proposed database changes for the tag_config table; In an attempt to show you what I was thinking, here's a number of questions;
From the Wiki page: "At repo creation time, the repodata will be retrieved from the processed url and merged with the local repodata as described above. This single repo will then be used for subsequent builds against the tag"
Do I understand correctly one can only give one single repository URL to a certain tag? Does this mean that a tag is created for (example) "dist-el5" with a remote repository URL, and then "dist-el5-updates" with another remote repository URL? This means for the build target used to have dist-el5-updates inherit dist-el5, right? Which then implies either metadata needs to be imported for dist-el5-updates or inheritance can only be applied during build-time... right?
The question I guess is basically; how does koji handle tags with a combination of remote urls & inheritance?
From the Wiki page: "Right now that (rpminfo) table enforces uniqueness of (name, version, release, arch)."
I see that koji does not store complete package nevra which may become a problem in case duplicate nvra occur (which is very much likely the case where rebuilding packages with the release number bumped might collide with upstream doing a release bump -which is where the epoch is often used as upstream has clear guidelines for epoch bumps which -hopefully- make them occur in special circumstances only and thus very much reduces the chance of a colliding nevra). I like the proposed uniqueness of NVRA-namespaces as well, don't get me wrong ;-)
The other thing (and probably the last thing for now) I'd like to share is that, for reproducibility purposes, how viable would it be to have koji automatically import the remote RPM (the file and all the data) as it is used from the remote repository? This may or may not be a configurable option, saves work for admins compared to the situation now, and preserves reproducibility under all circumstances, adding the automatically imported RPM to the appropriate tags, storing them for reproducibility whereas upstream only keeps two versions in the repository... Though I understand it 1) consumes space and 2) isn't helpful for the EPEL case, I think this is particularly useful for long-term supported appliance software. Just wondering here ;-)
Let me know what you think,
Kind regards,
Jeroen van Meeuwen -kanarip
Jeroen van Meeuwen wrote:
I'd like to share a few thoughts on the Wiki page -which is a great start;
(...)
Did I mention my primary concern with aforementioned questions are more related to "make-your-own" private koji instances rather then the one that is going to build EPEL?
Sorry for any confusion.
Kind regards,
Jeroen van Meeuwen -kanarip
On Thu, 2008-07-10 at 19:12 +0200, Jeroen van Meeuwen wrote:
Mike Bonnet wrote:
Hi. I've written up a proposal for a way to support EPEL builds in Koji. It's not the only way we could do this, but I think it's doable with a reasonable amount of effort, and has the side-effect of greatly simplifying the Koji setup process for a lot of people (by removing the need to bootstrap/import an entire distro of packages into your private Koji instance). You can view the proposal here:
http://fedoraproject.org/wiki/Koji/EPELSupport
It's fairly detailed regarding the data model changes necessary, so if you're not familiar with the Koji codebase you can skip those parts. Questions and comments welcome.
Hi Mike,
good to see you've spend some time on this whereas I have been lazy in Littleton (holiday).
I'd like to share a few thoughts on the Wiki page -which is a great start;
From the Wiki page: "There is a strong feeling that if a package exists in the Koji-managed local repo (whose contents the Koji admin has full control over) it should always be preferred over the external repo (whose contents the Koji admin may have little or no control over)."
The preference koji will have (in using which package in the buildroot), might introduce the problem where customly built package foo-1.0 is used in the buildroot, and upstream updates to foo-1.1 - the running nodes would update to foo-1.1 whereas the buildroot still uses the custom foo-1.0...
Yes, it's up to the Koji admin to monitor the remote repo, and take appropriate action when their custom local packages are superseded by packages in the remote repo. That may be untagging or blocking the package locally so the newer version can be pulled down from the remote repo. Or it may be rebuilding the custom package based on the updated sources. The point is that the build environment doesn't change unless the Koji admin takes some action to change it.
The point being, that these updates have to managed as they are released. The updates need to managed on the side where said packages are being mashed into a repository (infra side) or applied (client side).
You can see the duplicate effort when the updates are managed on either side (infra or client), _and_ in koji, separately.
There is duplicate effort either way. The difference is that, if highest-nvr-wins is used, and a remote repo updates to a later version of a package that you have a custom build of, there is *no way* for you to revert your build environment to that lower-nvr version without bumping your version higher than their version (without actually changing the source at all) and rebuilding. It encourages this Cold War arms-race of version numbers between your custom packages and the remote repo's packages, and results in the admin having to fake higher version numbers and rebuild constantly *without any source changes* just to keep their custom packages in their build environment.
Alternately, if first-match-wins is used (where the first repo is the locally-managed Koji repo), and a remote repo updates to a later version of a package you have a custom version of, nothing happens to your build environment. If you decide you want the newer version from the remote repo, you untag your local package and let it get pulled in from the remote repo. If that newer version has problems, retag your custom version and it will then be available in the build environment again. There is no unnecessary building of packages, no faking version numbers, and no unexpected changes to your build environment. It's the "principle of least surprise", which is why I think it's the right policy to use in a managed build environment like Koji.
I would like to suggest the koji development team makes the priority setting koji is going to use a configurable item -which in compared to the bigger picture isn't all that much a priority, just something to think about.
I strongly feel that this isn't something that needs to be configurable, and that first-match-wins is the correct behavior. But if other people agree that there is a valid use-case for making it configurable, and Seth and/or James can make the logic in repomerge configurable, then we can add switch for it to Koji.
Additionally, I'd like to comment on / ask about the proposed database changes for the tag_config table; In an attempt to show you what I was thinking, here's a number of questions;
From the Wiki page: "At repo creation time, the repodata will be retrieved from the processed url and merged with the local repodata as described above. This single repo will then be used for subsequent builds against the tag"
Do I understand correctly one can only give one single repository URL to a certain tag? Does this mean that a tag is created for (example) "dist-el5" with a remote repository URL, and then "dist-el5-updates" with another remote repository URL? This means for the build target used to have dist-el5-updates inherit dist-el5, right? Which then implies either metadata needs to be imported for dist-el5-updates or inheritance can only be applied during build-time... right?
The question I guess is basically; how does koji handle tags with a combination of remote urls & inheritance?
Originally you were correct, the proposal only allowed for a single remote repo to be configured. This was mandated by the desire to track packages back to their repository of origin, and the lack of repository data in the rpmdb. jkeating convinced me that this wasn't a very useful implementation, and suggested that we could get information about the origin of a given rpm from the baseurl in the repodata.
I've updated the wiki page with a new implementation proposal that will allow for multiple remote repos while still tracking package origin, and specifies how remote repos will interact with the tag inheritance tree. Please take a look and let me know what you think.
From the Wiki page: "Right now that (rpminfo) table enforces uniqueness of (name, version, release, arch)."
I see that koji does not store complete package nevra which may become a problem in case duplicate nvra occur (which is very much likely the case where rebuilding packages with the release number bumped might collide with upstream doing a release bump -which is where the epoch is often used as upstream has clear guidelines for epoch bumps which -hopefully- make them occur in special circumstances only and thus very much reduces the chance of a colliding nevra). I like the proposed uniqueness of NVRA-namespaces as well, don't get me wrong ;-)
Koji intentionally ignores epoch when enforcing uniqueness. For better or worse, the epoch is mostly hidden from users, and does not show up in the filename. Having packages with the same NVRA but different epochs was considered harmful when Koji was being designed, and it will prevent this from happening. Note that Koji does *store* the epoch, it just doesn't use it when enforcing uniqueness.
In the proposal, local packages exist in one NVRA namespace, and each remote repo (differentiated by URL) exists in a different NVRA namespace. So NVRA much be unique within each repo (local or remote) but not across repos. So NVRA collisions between your local Koji instance and a remote repo will not cause problems at the data model level. Which package gets selected and made available in the buildroots will be handled by the (possibly configurable) package selection policy of createrepo/mergerepo.
The other thing (and probably the last thing for now) I'd like to share is that, for reproducibility purposes, how viable would it be to have koji automatically import the remote RPM (the file and all the data) as it is used from the remote repository? This may or may not be a configurable option, saves work for admins compared to the situation now, and preserves reproducibility under all circumstances, adding the automatically imported RPM to the appropriate tags, storing them for reproducibility whereas upstream only keeps two versions in the repository... Though I understand it 1) consumes space and 2) isn't helpful for the EPEL case, I think this is particularly useful for long-term supported appliance software. Just wondering here ;-)
This sounds much more like the secondary-arch approach, and is separate from what we're trying to accomplish here. I had requested that the secondary-arch daemon support a "same-arch-downstream" mode where it would download and import (rather than rebuild) builds from an upstream Koji as they were completed. However, this is a lot more complicated and requires more detailed policy. If this is a requirement for you, I suggest you take a look at the secondary-arch work.
Mike Bonnet wrote:
This is mostly in line with what I've been thinking. I do have a few comments/concerns thought...
If the remote_repo_url data is going to be inherited (and I tend to think it should be), then I think it should be in a separate table. I'd like to reserve tag_config for data that is local to individual tags. This will also make it easier to represent multiple remote repos.
I'm a little concerned about using the rpminfo table. Yes, I know it seems wasteful to introduce another table to track very similar data, but these remote rpms really are differently tracked and handled than the local ones.
Also, I'm not sure how I feel about having rpminfo entries will null build_id. Sure, technically the field lacks the 'not null' constraint, but that is more of an oversight.
Note, I'm not outright rejecting the idea of using rpminfo this way, but I am concerned.
As for the origin field. I think we should track where these external rpms come from, but I'm not sure about including in the uniqueness constraint. I'm not sure that the value of that field is sufficiently well defined (or canonicalizable) for such use. I'd rather see the sigmd5 value (or some abstracting sighash field) used as a unique index.
Following are additional ideas relating to this feature. They are perhaps a bit ambitious for the short term, but I'd at least like to keep them in mind with the initial design so we don't paint ourselves into a corner.
First, I'd like to be able to support external koji servers (or rather a target or tag from an external koji server) in addition to external repos. Some of the ideas are the same, however an external koji server provides more information and more structure.
Second, I'm fond of having a tag /represent/ some external repo/whatever and having the normal inheritance mechanism take care of priority. The trick here is that Koji tag content is by build, but it will be tricky to correctly determine build structure for external rpms -- indeed, external repos might include subpackages from different versions of the same build (the an external koji server would not, at least for its local content). So this will probably be difficult, but if we could manage something like this, I'd feel a lot better about using the rpminfo table.
Doing something like this would most likely require Koji to comprehend the external repos instead of just passing them off to a repomerge tool.
Third, we may not want to use a repomerge tool. The yum-priorities plugin might serve just as well, and allow us to specify some different yum repo options per external repo. This may conflict with idea#2 though.
On Thursday 17 July 2008, Mike McLean wrote:
Mike Bonnet wrote:
This is mostly in line with what I've been thinking. I do have a few comments/concerns thought...
If the remote_repo_url data is going to be inherited (and I tend to think it should be), then I think it should be in a separate table. I'd like to reserve tag_config for data that is local to individual tags. This will also make it easier to represent multiple remote repos.
I'm a little concerned about using the rpminfo table. Yes, I know it seems wasteful to introduce another table to track very similar data, but these remote rpms really are differently tracked and handled than the local ones.
Also, I'm not sure how I feel about having rpminfo entries will null build_id. Sure, technically the field lacks the 'not null' constraint, but that is more of an oversight.
Note, I'm not outright rejecting the idea of using rpminfo this way, but I am concerned.
As for the origin field. I think we should track where these external rpms come from, but I'm not sure about including in the uniqueness constraint. I'm not sure that the value of that field is sufficiently well defined (or canonicalizable) for such use. I'd rather see the sigmd5 value (or some abstracting sighash field) used as a unique index.
Following are additional ideas relating to this feature. They are perhaps a bit ambitious for the short term, but I'd at least like to keep them in mind with the initial design so we don't paint ourselves into a corner.
First, I'd like to be able to support external koji servers (or rather a target or tag from an external koji server) in addition to external repos. Some of the ideas are the same, however an external koji server provides more information and more structure.
In addition to external koji servers, id like to support spacewalk servers. and have the ability to push builds back into channels on spacewalk servers. ideally the spacewalk server knows how to pull from koji server rather than duplicating data by importing directly. this way an organisation could build upon fedora/RHEL/CentOS for their own needs. but can also have an easier time doing rel-eng on them.
Second, I'm fond of having a tag /represent/ some external repo/whatever and having the normal inheritance mechanism take care of priority. The trick here is that Koji tag content is by build, but it will be tricky to correctly determine build structure for external rpms -- indeed, external repos might include subpackages from different versions of the same build (the an external koji server would not, at least for its local content). So this will probably be difficult, but if we could manage something like this, I'd feel a lot better about using the rpminfo table.
i would think there should be a 1-1 mapping of tag external repo using normal inheritence.
Doing something like this would most likely require Koji to comprehend the external repos instead of just passing them off to a repomerge tool.
Third, we may not want to use a repomerge tool. The yum-priorities plugin might serve just as well, and allow us to specify some different yum repo options per external repo. This may conflict with idea#2 though.
I can see a case where this wont work. i have a local tag built on top of F-8 i want it lower than the remote F-9 because some of what i need is now in fedora, but i need other bits from my tag to be inherited so that i can boot strap things to the F-9 level. maybe we would produce 2 local repos and use yum priorites. to fit them together. maybe this case is rare enough not to bother with. but it could be an idea to keep in mind.
On Thu, 2008-07-17 at 13:54 -0400, Mike McLean wrote:
Mike Bonnet wrote:
This is mostly in line with what I've been thinking. I do have a few comments/concerns thought...
If the remote_repo_url data is going to be inherited (and I tend to think it should be), then I think it should be in a separate table. I'd like to reserve tag_config for data that is local to individual tags. This will also make it easier to represent multiple remote repos.
I don't have any problem with this, though it does mean we'll need to duplicate quite a bit of the inheritance-walking code, or make it configurable as to which inheritance it's walking. This new table would also have to be versioned, the same way the tag_config table is.
I'm a little concerned about using the rpminfo table. Yes, I know it seems wasteful to introduce another table to track very similar data, but these remote rpms really are differently tracked and handled than the local ones.
The big win here is that the methods and tools that query rpminfo for information about what was present in the buildroot at build time wouldn't have to change, or only change slightly. With minor modification the web UI can continue to show a list of all packages in a buildroot, along with a flag indicating if they were local or remote. The buildroot_listing table would not have to change at all. The majority of XML-RPC calls that interact with the rpminfo or buildroot_listing tables would only need minor modifications. Adding a new table to track remote rpms metadata and which remote rpms end up in a buildroot would add significant effort to this proposal. Also, I think it's more semantically correct to have a single place where we track rpm metadata and buildroot contents, regardless of where they came from.
Also, I'm not sure how I feel about having rpminfo entries will null build_id. Sure, technically the field lacks the 'not null' constraint, but that is more of an oversight.
Yes, I realize that the "not null" constraint should exist now, and in fact all rpms in the Fedora database do reference builds. However, I think logically having a remote rpm not reference a local build makes sense. The alternative is to create the build object from the srpm info in the repodata (along with some namespacing similar to rpminfo). However, this would significantly clutter the build table with information that is pretty non-essential.
Note, I'm not outright rejecting the idea of using rpminfo this way, but I am concerned.
As for the origin field. I think we should track where these external rpms come from, but I'm not sure about including in the uniqueness constraint. I'm not sure that the value of that field is sufficiently well defined (or canonicalizable) for such use. I'd rather see the sigmd5 value (or some abstracting sighash field) used as a unique index.
I'm open to suggestions on how to modify the uniqueness constraint to handle this case. We care about ensuring that a locally-built rpm doesn't have the same n-v-r as another locally-built rpm. I don't think we care at all about n-v-r uniqueness amongst remote rpms. However, we probably want to avoid creating 2 rpminfo entries when the same remote rpm is used in 2 different buildroots. Using the sigmd5 is a good way to avoid that. However, what happens if a remote rpm with the same n-v-r and sigmd5 gets pulled in from 2 different remote repos? Perhaps the "origin" field should be pushed down to the buildroot_listing table, so the buildroots can reference the same rpminfo object, but indicate that it came from a different repo in each buildroot?
Also, what happens when we find 2 remote rpms with the same n-v-r but different sigmd5s? Should that be an error?
Following are additional ideas relating to this feature. They are perhaps a bit ambitious for the short term, but I'd at least like to keep them in mind with the initial design so we don't paint ourselves into a corner.
First, I'd like to be able to support external koji servers (or rather a target or tag from an external koji server) in addition to external repos. Some of the ideas are the same, however an external koji server provides more information and more structure.
I agree that this is a desirable goal. I believe this is more the domain of the Koji secondary-arch daemon. It would be talking directly to an "upstream" Koji server, analyzing what it's doing, and applying some logic to decide what builds to import or replicate, and where/how to do it. This proposal has the much more modest goal of simply consuming static external repos, and is more appropriate for the EPEL and private-standalone-Koji case.
Second, I'm fond of having a tag /represent/ some external repo/whatever and having the normal inheritance mechanism take care of priority. The trick here is that Koji tag content is by build, but it will be tricky to correctly determine build structure for external rpms -- indeed, external repos might include subpackages from different versions of the same build (the an external koji server would not, at least for its local content). So this will probably be difficult, but if we could manage something like this, I'd feel a lot better about using the rpminfo table.
Doing something like this would most likely require Koji to comprehend the external repos instead of just passing them off to a repomerge tool.
The tag content may be managed by build, but when it's time for it to actually get used (in the form of a yum repo) it gets unfolded into a big list of rpms. And what gets associated with a buildroot is simply a big list of rpms. Conceptually I don't really have a problem with the idea of a tag as a big list of rpms, that we happen to group by srpm within Koji because it's more convenient for us. So adding the external repo information to tag_config is just an extension of the big list of rpms model.
However, we will already be parsing the remote repodata, which contains information like the srpm name for each rpm, so we could do something more sophisticated here.
Third, we may not want to use a repomerge tool. The yum-priorities plugin might serve just as well, and allow us to specify some different yum repo options per external repo. This may conflict with idea#2 though.
This was my first thought as well. However, after discussions with Jesse, Seth, and James I was convinced otherwise. The yum-priorities plugin seems very unpopular with yum developers (not quite sure why). I don't think yum-priorities would give us any way to completely block a package from local and remote repos, and configuring multiple repos in the mock config would require Koji to retrieve and parse each remote repodata to determine the origin of a given remote rpm.
The repomerge tool seems like it solves the problem better, and would be more useful in general.
On Thu, 2008-07-17 at 18:48 -0400, Mike Bonnet wrote:
This was my first thought as well. However, after discussions with Jesse, Seth, and James I was convinced otherwise. The yum-priorities plugin seems very unpopular with yum developers (not quite sure why). I don't think yum-priorities would give us any way to completely block a package from local and remote repos, and configuring multiple repos in the mock config would require Koji to retrieve and parse each remote repodata to determine the origin of a given remote rpm.
Also you wouldn't be able to prioritize at the srpm level which is what we want (no unwanted subpackages sneaking in).
Mike Bonnet wrote:
On Thu, 2008-07-17 at 13:54 -0400, Mike McLean wrote:
If the remote_repo_url data is going to be inherited (and I tend to think it should be), then I think it should be in a separate table. I'd like to reserve tag_config for data that is local to individual tags. This will also make it easier to represent multiple remote repos.
I don't have any problem with this, though it does mean we'll need to duplicate quite a bit of the inheritance-walking code, or make it configurable as to which inheritance it's walking. This new table would also have to be versioned, the same way the tag_config table is.
Walking inheritance is just a matter of determining the inheritance order and scanning data on the parent tags in sequence. Currently, nothing scans tag_config in this way because no data in tag_config is inherited. (Well, in a sense tag_changed_since_event() does walk tag_config, but that's a little different.)
We need to figure out how we'll deal with multiplicity for the external repos. If tag A uses repo X and inherits from tag B which uses repo Y, then does tag A use both X and Y, or does the X entry override it? A (+repo X) +- B (+repo Y)
My inclination is that it should override, because I think we'll want some way to do override that that mechanism seems easiest.
Also, I think we'll probably want to allow multiple external repos per tag, something which will be much easier to represent in an external table. We can include an explicit priority field to make a sane uniqueness condition (and to provide a clear ordering for the repo merge).
The big win here is that the methods and tools that query rpminfo for information about what was present in the buildroot at build time
-snip-
I see all that, and I'm almost convinced. The flipside is that by default all the code will treat these external rpms the same as the local ones, which will not be correct for a number of cases. Obviously, part of this will involve changing code to behave differently for the external ones, I'm just worried about how much we might have to change, or what we might miss.
Yes, I realize that the "not null" constraint should exist now, and in fact all rpms in the Fedora database do reference builds. However, I think logically having a remote rpm not reference a local build makes sense. The alternative is to create the build object from the srpm info in the repodata (along with some namespacing similar to rpminfo). However, this would significantly clutter the build table with information that is pretty non-essential.
The idea of grouping them into builds appeals to me, but I don't think it's possible in general (though maybe we could fake it well enough somehow). The only data we're (mostly) guaranteed to have to work with is the sourcerpm header field. The catch is that in case of an nvr-collision we can't determine which build it belongs to (or indeed if we should create a new build of same nvr).
I'm open to suggestions on how to modify the uniqueness constraint to handle this case. We care about ensuring that a locally-built rpm doesn't have the same n-v-r as another locally-built rpm. I don't think we care at all about n-v-r uniqueness amongst remote rpms. However, we probably want to avoid creating 2 rpminfo entries when the same remote rpm is used in 2 different buildroots. Using the sigmd5 is a good way to avoid that.
Agreed. same sigmd5 ==> same rpm.
However, what happens if a remote rpm with the same n-v-r and sigmd5 gets pulled in from 2 different remote repos?
This gets into part of what bugs me about this and why I'm somewhat inclined to keep the ext repo data a step removed. It's so potentially dirty. Koji has all these consistency constraints that an external repo (much less many of them in aggregate) lacks.
It's quite possible that an external repo might respin a package keeping the same nvr, so we don't even need 2 external repos to hit this possibility.
Perhaps the "origin" field should be pushed down to the buildroot_listing table, so the buildroots can reference the same rpminfo object, but indicate that it came from a different repo in each buildroot?
Interesting. Yeah, I think that is is probably the right answer.
Also, I'm thinking we need to have some sort of rpm_origin table so that all these references can be managed cleanly.
Also, what happens when we find 2 remote rpms with the same n-v-r but different sigmd5s? Should that be an error?
Certainly we have to allow the possibility that two origins might have overlapping nvras. Within a single origin, I'm not so sure. I suppose we can get away with some small consistency demands. As long as we're only enforcing unique nvra for local builds and indexing by sigmd5/similar, I don't think we /have/ to make this an error condition.
In the same vein, what happens when an external repo has an nvra+sigmd5 matching a /local/ rpm? Maybe it doesn't matter, though I guess technically we want to record the origin properly when it gets into a buildroot via external repo vs internal tag.
First, I'd like to be able to support external koji servers (or rather a
...
I agree that this is a desirable goal. I believe this is more the domain of the Koji secondary-arch daemon. It would be talking directly
Well, it has some similarities to 2nd arch, but still quite different.
The more I think about it, the more I think that supporting an external koji server will probably be much different from from the ext repo business. Most of the issues with rpminfo will carry over, but with a koji server we will be able to determine build data and can probably actually pull off something like "inherit from tag X on koji server Y."
The tag content may be managed by build, but when it's time for it to actually get used (in the form of a yum repo) it gets unfolded into a big list of rpms. And what gets associated with a buildroot is simply a big list of rpms. Conceptually I don't really have a problem with the idea of a tag as a big list of rpms, that we happen to group by srpm within Koji because it's more convenient for us. So adding the external repo information to tag_config is just an extension of the big list of rpms model.
Yeah, I almost wish I hadn't made the build structure quite the way I did.
However, we will already be parsing the remote repodata, which contains information like the srpm name for each rpm, so we could do something more sophisticated here.
-snipsnip- ...
The repomerge tool seems like it solves the problem better, and would be more useful in general.
If we're going to have our fingers in the repodata, we'll probably want to have them in the merge too. Perhaps we can get createrepo and/or this repomerge tool usefully libified?
On Fri, 2008-07-18 at 11:38 -0400, Mike McLean wrote:
Mike Bonnet wrote:
On Thu, 2008-07-17 at 13:54 -0400, Mike McLean wrote:
If the remote_repo_url data is going to be inherited (and I tend to think it should be), then I think it should be in a separate table. I'd like to reserve tag_config for data that is local to individual tags. This will also make it easier to represent multiple remote repos.
I don't have any problem with this, though it does mean we'll need to duplicate quite a bit of the inheritance-walking code, or make it configurable as to which inheritance it's walking. This new table would also have to be versioned, the same way the tag_config table is.
Walking inheritance is just a matter of determining the inheritance order and scanning data on the parent tags in sequence. Currently, nothing scans tag_config in this way because no data in tag_config is inherited. (Well, in a sense tag_changed_since_event() does walk tag_config, but that's a little different.)
Sorry, I was referring to walking tag_inheritance. I'd rather have one place that walks the inheritance hierarchy and aggregates data from it, than two places that are doing almost the same thing.
Each tag has a set of builds associated with it. We walk the inheritance hierarchy, aggregating the builds from each tag in the hierarchy into a flat list, and then pass that list to createrepo. We would do essentially the same thing for external repos. When walking the hierarchy, if a tag has an external repo associated with it, we would append that repo url to a flat list, and pass that list to mergerepo. In both cases we're working with collections of packages that are associated with a tag, just in different formats.
We need to figure out how we'll deal with multiplicity for the external repos. If tag A uses repo X and inherits from tag B which uses repo Y, then does tag A use both X and Y, or does the X entry override it? A (+repo X) +- B (+repo Y)
My inclination is that it should override, because I think we'll want some way to do override that that mechanism seems easiest.
In discussing this with Jesse, I think we want external repos to be inherited. This is probably the easiest way to deal with having multiple external repos getting pulled in to a single buildroot, which is essential for Fedora (think F9 GA and F9 Updates).
The idea was that, by convention, we would have external-repo-only tags, with only a single external repo associated with it and no packages/builds associated. These external-repo-only tags could then be inserted into the build hierarchy where appropriate. An ordered list of external repos could then be constructed by performing the current depth-first search of the inheritance hierarchy. The ordered list would then be passed to mergerepo, which would ensure that packages in repos earlier in the list supersede packages (by srpm name) in repos later in the list. This would preserve the "first-match-wins" inheritance policy that Koji currently implements, and that admins expect. For example:
dist-custom-build ├─dist-custom └─dist-f9-updates-external └─dist-f9-ga-external
would result mergerepo creating a single repo that would only contain packages from dist-f9-ga-external if they did not exist in the Koji-generated repo (dist-custom-build + dist-custom), dist-f9-updates-external, or the blacklist of blocked packages. This is consistent with how Koji package inheritance currently works, and I think is the most intuitive approach.
Also, I think we'll probably want to allow multiple external repos per tag, something which will be much easier to represent in an external table. We can include an explicit priority field to make a sane uniqueness condition (and to provide a clear ordering for the repo merge).
As outlined above, I'd prefer to keep it to one external repo per tag, along with repo inheritance. I think this is easier from a management perspective, and more consistent with the way Koji currently works. Ordering for mergerepo will be represented by the location of the tag in the inheritance hierarchy. With a 1-to-1 tag->external repo mapping, it then makes sense to store the external repo url in the tag_config table.
The big win here is that the methods and tools that query rpminfo for information about what was present in the buildroot at build time
-snip-
I see all that, and I'm almost convinced. The flipside is that by default all the code will treat these external rpms the same as the local ones, which will not be correct for a number of cases. Obviously, part of this will involve changing code to behave differently for the external ones, I'm just worried about how much we might have to change, or what we might miss.
Personally I'd prefer adding a few special cases to the existing code, rather than maintain a whole heap of almost-but-not-quite-the-same code to manage external rpms. I think that conceptually they're alike enough that the number of special cases will be minimal.
Yes, I realize that the "not null" constraint should exist now, and in fact all rpms in the Fedora database do reference builds. However, I think logically having a remote rpm not reference a local build makes sense. The alternative is to create the build object from the srpm info in the repodata (along with some namespacing similar to rpminfo). However, this would significantly clutter the build table with information that is pretty non-essential.
The idea of grouping them into builds appeals to me, but I don't think it's possible in general (though maybe we could fake it well enough somehow). The only data we're (mostly) guaranteed to have to work with is the sourcerpm header field. The catch is that in case of an nvr-collision we can't determine which build it belongs to (or indeed if we should create a new build of same nvr).
I think that synthesizing builds for that sake of maintaining the not-null constraint is more pain than it's worth, and would make enforcing our nvr-uniqueness constraints (which we definitely want to do for local builds) more difficult. Having locally-built rpms always associated with a build, and external rpms not, makes sense to me.
I'm open to suggestions on how to modify the uniqueness constraint to handle this case. We care about ensuring that a locally-built rpm doesn't have the same n-v-r as another locally-built rpm. I don't think we care at all about n-v-r uniqueness amongst remote rpms. However, we probably want to avoid creating 2 rpminfo entries when the same remote rpm is used in 2 different buildroots. Using the sigmd5 is a good way to avoid that.
Agreed. same sigmd5 ==> same rpm.
However, what happens if a remote rpm with the same n-v-r and sigmd5 gets pulled in from 2 different remote repos?
This gets into part of what bugs me about this and why I'm somewhat inclined to keep the ext repo data a step removed. It's so potentially dirty. Koji has all these consistency constraints that an external repo (much less many of them in aggregate) lacks.
It's quite possible that an external repo might respin a package keeping the same nvr, so we don't even need 2 external repos to hit this possibility.
Perhaps the "origin" field should be pushed down to the buildroot_listing table, so the buildroots can reference the same rpminfo object, but indicate that it came from a different repo in each buildroot?
Interesting. Yeah, I think that is is probably the right answer.
Also, I'm thinking we need to have some sort of rpm_origin table so that all these references can be managed cleanly.
That sounds reasonable to me. Note that we may end up with a lot of rows in this table, since we're allowing variable substitution in the external_repo_url (tag name and arch). But I don't see that as a problem.
Also, what happens when we find 2 remote rpms with the same n-v-r but different sigmd5s? Should that be an error?
Certainly we have to allow the possibility that two origins might have overlapping nvras. Within a single origin, I'm not so sure. I suppose we can get away with some small consistency demands. As long as we're only enforcing unique nvra for local builds and indexing by sigmd5/similar, I don't think we /have/ to make this an error condition.
Yeah, it's probably safest to not make this an error condition, since we have very little control over the remote repos.
In the same vein, what happens when an external repo has an nvra+sigmd5 matching a /local/ rpm? Maybe it doesn't matter, though I guess technically we want to record the origin properly when it gets into a buildroot via external repo vs internal tag.
Right, we would record the origin as the remote repo it came from (by parsing the merged repodata and looking at the baseurl).
First, I'd like to be able to support external koji servers (or rather a
...
I agree that this is a desirable goal. I believe this is more the domain of the Koji secondary-arch daemon. It would be talking directly
Well, it has some similarities to 2nd arch, but still quite different.
The more I think about it, the more I think that supporting an external koji server will probably be much different from from the ext repo business. Most of the issues with rpminfo will carry over, but with a koji server we will be able to determine build data and can probably actually pull off something like "inherit from tag X on koji server Y."
And in the external Koji server case, it might actually make sense to create build objects for the external rpms, since we'll be able to query the external Koji about which build an rpm came from.
The tag content may be managed by build, but when it's time for it to actually get used (in the form of a yum repo) it gets unfolded into a big list of rpms. And what gets associated with a buildroot is simply a big list of rpms. Conceptually I don't really have a problem with the idea of a tag as a big list of rpms, that we happen to group by srpm within Koji because it's more convenient for us. So adding the external repo information to tag_config is just an extension of the big list of rpms model.
Yeah, I almost wish I hadn't made the build structure quite the way I did.
However, we will already be parsing the remote repodata, which contains information like the srpm name for each rpm, so we could do something more sophisticated here.
-snipsnip- ...
The repomerge tool seems like it solves the problem better, and would be more useful in general.
If we're going to have our fingers in the repodata, we'll probably want to have them in the merge too. Perhaps we can get createrepo and/or this repomerge tool usefully libified?
I was thinking we would probably just call out to the tool the way we do for createrepo, but I'm certainly not against using an API. I'm a little concerned about memory usage when doing the create/mergerepo in-process, since we know python and mod_python have garbage-collection issues, but that may be a "cross the bridge when we come to it" problem. Seth, is it feasible to provide an API to mergerepo that we could use directly?
On Wed, 2008-08-13 at 17:35 -0400, Mike Bonnet wrote:
I was thinking we would probably just call out to the tool the way we do for createrepo, but I'm certainly not against using an API. I'm a little concerned about memory usage when doing the create/mergerepo in-process, since we know python and mod_python have garbage-collection issues, but that may be a "cross the bridge when we come to it" problem. Seth, is it feasible to provide an API to mergerepo that we could use directly?
createrepo has an api. repomerge should be relatively easy to use the same way since repomerge is really just a combination script using createrepo and yum's interfaces.
when I have the script cleaned up more I'll make sure you can import it usefully.
-sv
Mike Bonnet wrote:
On Fri, 2008-07-18 at 11:38 -0400, Mike McLean wrote:
Mike Bonnet wrote:
On Thu, 2008-07-17 at 13:54 -0400, Mike McLean wrote:
If the remote_repo_url data is going to be inherited (and I tend to think it should be), then I think it should be in a separate table.
...
I don't have any problem with this, though it does mean we'll need to duplicate quite a bit of the inheritance-walking code,
...
Walking inheritance is just a matter of determining the inheritance order and scanning data on the parent tags in sequence.
...
Sorry, I was referring to walking tag_inheritance. I'd rather have one place that walks the inheritance hierarchy and aggregates data from it, than two places that are doing almost the same thing.
We're talking about inherently different data. External repos to be merged in are quite different from builds in the system.
Each tag has a set of builds associated with it. We walk the inheritance hierarchy, aggregating the builds from each tag in the hierarchy into a flat list, and then pass that list to createrepo. We would do essentially the same thing for external repos. When walking the hierarchy, if a tag has an external repo associated with it, we would append that repo url to a flat list, and pass that list to mergerepo. In both cases we're working with collections of packages that are associated with a tag, just in different formats.
Sure, we can do this with one call to readFullInheritance, and traverse both the build table and external repo table from the given order.
In discussing this with Jesse, I think we want external repos to be inherited. This is probably the easiest way to deal with having multiple external repos getting pulled in to a single buildroot, which is essential for Fedora (think F9 GA and F9 Updates).
The idea was that, by convention, we would have external-repo-only tags, with only a single external repo associated with it and no packages/builds associated. These external-repo-only tags could then be inserted into the build hierarchy where appropriate. An ordered list of external repos could then be constructed by performing the current depth-first search of the inheritance hierarchy. The ordered list would then be passed to mergerepo, which would ensure that packages in repos earlier in the list supersede packages (by srpm name) in repos later in the list. This would preserve the "first-match-wins" inheritance policy that Koji currently implements, and that admins expect. For example:
dist-custom-build ├─dist-custom └─dist-f9-updates-external └─dist-f9-ga-external
would result mergerepo creating a single repo that would only contain packages from dist-f9-ga-external if they did not exist in the Koji-generated repo (dist-custom-build + dist-custom), dist-f9-updates-external, or the blacklist of blocked packages. This is consistent with how Koji package inheritance currently works, and I think is the most intuitive approach.
It is similar, but different in potentially confusing ways. External repos do not have build structure, so we can't really have the same sort of inheritance behavior with a combination of external repo tags and normal tags.
We order the external repos in inheritance order, but ultimately those repos are merged with the internal one in a way that does not honor inheritance in the way that the admin might expect.
Using tags to represent external repos fails intuition because external repos are very much not like tags. When we get to supporting external koji systems, we can do something like this, but for external repos the "bolted-on" nature needs to be clear. This is why I'd prefer to have the data a little more removed.
I see all that, and I'm almost convinced. The flipside is that by default all the code will treat these external rpms the same as the local ones, which will not be correct for a number of cases.
Personally I'd prefer adding a few special cases to the existing code, rather than maintain a whole heap of almost-but-not-quite-the-same code to manage external rpms. I think that conceptually they're alike enough that the number of special cases will be minimal.
I think I'm ok with using the rpminfo table.
I think that synthesizing builds for that sake of maintaining the not-null constraint is more pain than it's worth, and would make enforcing our nvr-uniqueness constraints (which we definitely want to do for local builds) more difficult. Having locally-built rpms always associated with a build, and external rpms not, makes sense to me.
Ok, agreed.
Also, I'm thinking we need to have some sort of rpm_origin table so that all these references can be managed cleanly.
That sounds reasonable to me. Note that we may end up with a lot of rows in this table, since we're allowing variable substitution in the external_repo_url (tag name and arch). But I don't see that as a problem.
I'm thinking the only substitution we should support is arch. Anything else sort of constitutes a different repo.
If we use an origin table like this we can abstract out the arch. Something like:
create table external_repo ( id SERIAL PRIMARY KEY, name TEXT ); create table external_repo_config ( external_repo_id INTEGER NOT NULL REFERENCES external_repo (id), url TEXT NOT NULL, -- plus versioning fields -- ... );
This way if upstream repo changes url scheme or moves to a different host, you can keep some notion of connectedness. External rpms would simply reference external_repo_id.
In the same vein, what happens when an external repo has an nvra+sigmd5 matching a /local/ rpm? Maybe it doesn't matter, though I guess technically we want to record the origin properly when it gets into a buildroot via external repo vs internal tag.
Right, we would record the origin as the remote repo it came from (by parsing the merged repodata and looking at the baseurl).
So where do we draw the line between code that we add to koji and code that we add to createrepo (or some external merge-repo tool)?
However, we will already be parsing the remote repodata, which contains information like the srpm name for each rpm, so we could do something more sophisticated here.
-snipsnip- ...
The repomerge tool seems like it solves the problem better, and would be more useful in general.
If we're going to have our fingers in the repodata, we'll probably want to have them in the merge too. Perhaps we can get createrepo and/or this repomerge tool usefully libified?
I was thinking we would probably just call out to the tool the way we do for createrepo, but I'm certainly not against using an API. I'm a little concerned about memory usage when doing the create/mergerepo in-process, since we know python and mod_python have garbage-collection issues, but that may be a "cross the bridge when we come to it" problem. Seth, is it feasible to provide an API to mergerepo that we could use directly?
I don't think I even saw a reply from Seth on this. Where does the mergerepo code stand now?
On Mon, 2008-10-06 at 15:14 -0400, Mike McLean wrote:
would result mergerepo creating a single repo that would only contain packages from dist-f9-ga-external if they did not exist in the Koji-generated repo (dist-custom-build + dist-custom), dist-f9-updates-external, or the blacklist of blocked packages. This is consistent with how Koji package inheritance currently works, and I think is the most intuitive approach.
It is similar, but different in potentially confusing ways. External repos do not have build structure, so we can't really have the same sort of inheritance behavior with a combination of external repo tags and normal tags.
I don't think I even saw a reply from Seth on this. Where does the mergerepo code stand now?
mergerepo has been checked into createrepo and should do what you want, now.
it requires HEAD of createrepo and as soon as I make a new release yum 3.2.19-6 or 3.2.20 of yum.
-sv
Picking up this thread again, sorry about the long delay. I'd like to come to consensus on the approach here, hammer out any remaining details at FUDCon this weekend, and hopefully get this implemented by the end of January. Time to really get rid of plague!
On Mon, 2008-10-06 at 15:14 -0400, Mike McLean wrote:
Mike Bonnet wrote:
On Fri, 2008-07-18 at 11:38 -0400, Mike McLean wrote:
Mike Bonnet wrote:
On Thu, 2008-07-17 at 13:54 -0400, Mike McLean wrote:
If the remote_repo_url data is going to be inherited (and I tend to think it should be), then I think it should be in a separate table.
...
I don't have any problem with this, though it does mean we'll need to duplicate quite a bit of the inheritance-walking code,
...
Walking inheritance is just a matter of determining the inheritance order and scanning data on the parent tags in sequence.
...
Sorry, I was referring to walking tag_inheritance. I'd rather have one place that walks the inheritance hierarchy and aggregates data from it, than two places that are doing almost the same thing.
We're talking about inherently different data. External repos to be merged in are quite different from builds in the system.
Yes, I see the issue here. Since remote repos won't have their packages filtered out (by mergerepo) until after all packages in the local inheritance hierarchy are placed in the repo, they don't really follow the existing inheritance rules.
Ok, you've convinced me. A separate table that stores a priority-ordered list of remote repos associated with each tag will probably be easier to manage. The lists will be aggregated when walking the tag hierarchy and passed to mergerepo in (priority, inheritance) order for proper filtering (based on srpm name, first match wins).
Each tag has a set of builds associated with it. We walk the inheritance hierarchy, aggregating the builds from each tag in the hierarchy into a flat list, and then pass that list to createrepo. We would do essentially the same thing for external repos. When walking the hierarchy, if a tag has an external repo associated with it, we would append that repo url to a flat list, and pass that list to mergerepo. In both cases we're working with collections of packages that are associated with a tag, just in different formats.
Sure, we can do this with one call to readFullInheritance, and traverse both the build table and external repo table from the given order.
Yes, that makes sense.
In discussing this with Jesse, I think we want external repos to be inherited. This is probably the easiest way to deal with having multiple external repos getting pulled in to a single buildroot, which is essential for Fedora (think F9 GA and F9 Updates).
The idea was that, by convention, we would have external-repo-only tags, with only a single external repo associated with it and no packages/builds associated. These external-repo-only tags could then be inserted into the build hierarchy where appropriate. An ordered list of external repos could then be constructed by performing the current depth-first search of the inheritance hierarchy. The ordered list would then be passed to mergerepo, which would ensure that packages in repos earlier in the list supersede packages (by srpm name) in repos later in the list. This would preserve the "first-match-wins" inheritance policy that Koji currently implements, and that admins expect. For example:
dist-custom-build ├─dist-custom └─dist-f9-updates-external └─dist-f9-ga-external
would result mergerepo creating a single repo that would only contain packages from dist-f9-ga-external if they did not exist in the Koji-generated repo (dist-custom-build + dist-custom), dist-f9-updates-external, or the blacklist of blocked packages. This is consistent with how Koji package inheritance currently works, and I think is the most intuitive approach.
It is similar, but different in potentially confusing ways. External repos do not have build structure, so we can't really have the same sort of inheritance behavior with a combination of external repo tags and normal tags.
We order the external repos in inheritance order, but ultimately those repos are merged with the internal one in a way that does not honor inheritance in the way that the admin might expect.
Using tags to represent external repos fails intuition because external repos are very much not like tags. When we get to supporting external koji systems, we can do something like this, but for external repos the "bolted-on" nature needs to be clear. This is why I'd prefer to have the data a little more removed.
Ok, we're agreed on this.
I see all that, and I'm almost convinced. The flipside is that by default all the code will treat these external rpms the same as the local ones, which will not be correct for a number of cases.
Personally I'd prefer adding a few special cases to the existing code, rather than maintain a whole heap of almost-but-not-quite-the-same code to manage external rpms. I think that conceptually they're alike enough that the number of special cases will be minimal.
I think I'm ok with using the rpminfo table.
I think that synthesizing builds for that sake of maintaining the not-null constraint is more pain than it's worth, and would make enforcing our nvr-uniqueness constraints (which we definitely want to do for local builds) more difficult. Having locally-built rpms always associated with a build, and external rpms not, makes sense to me.
Ok, agreed.
Also, I'm thinking we need to have some sort of rpm_origin table so that all these references can be managed cleanly.
That sounds reasonable to me. Note that we may end up with a lot of rows in this table, since we're allowing variable substitution in the external_repo_url (tag name and arch). But I don't see that as a problem.
I'm thinking the only substitution we should support is arch. Anything else sort of constitutes a different repo.
If we use an origin table like this we can abstract out the arch. Something like:
create table external_repo ( id SERIAL PRIMARY KEY, name TEXT ); create table external_repo_config ( external_repo_id INTEGER NOT NULL REFERENCES external_repo (id), url TEXT NOT NULL, -- plus versioning fields -- ... );
This way if upstream repo changes url scheme or moves to a different host, you can keep some notion of connectedness. External rpms would simply reference external_repo_id.
Makes sense. So a tag would simply reference the external_repo_id as well, and the repo url would be set elsewhere (globally). The table storing the external repo info for tags would look like:
create table tag_external_repos ( tag_id INTEGER NOT NULL REFERENCES tag(id), external_repo_id INTEGER NOT NULL REFERENCES external_repo(id), priority INTEGER NOT NULL, -- plus versioning fields UNIQUE (tag_id,priority,active) );
I like this, it keeps everything much more normalized.
In the same vein, what happens when an external repo has an nvra+sigmd5 matching a /local/ rpm? Maybe it doesn't matter, though I guess technically we want to record the origin properly when it gets into a buildroot via external repo vs internal tag.
Right, we would record the origin as the remote repo it came from (by parsing the merged repodata and looking at the baseurl).
Right, and the origin can just be stored as a reference to the external_repo(id).
So where do we draw the line between code that we add to koji and code that we add to createrepo (or some external merge-repo tool)?
Koji would only be responsible for parsing the repodata and populating the database with the correct origin for any given rpm. mergerepo would be responsible for creating the repo and enforcing the filtering rules.
However, we will already be parsing the remote repodata, which contains information like the srpm name for each rpm, so we could do something more sophisticated here.
-snipsnip- ...
The repomerge tool seems like it solves the problem better, and would be more useful in general.
If we're going to have our fingers in the repodata, we'll probably want to have them in the merge too. Perhaps we can get createrepo and/or this repomerge tool usefully libified?
I was thinking we would probably just call out to the tool the way we do for createrepo, but I'm certainly not against using an API. I'm a little concerned about memory usage when doing the create/mergerepo in-process, since we know python and mod_python have garbage-collection issues, but that may be a "cross the bridge when we come to it" problem. Seth, is it feasible to provide an API to mergerepo that we could use directly?
I don't think I even saw a reply from Seth on this. Where does the mergerepo code stand now?
It has been written by Seth, I just need to test it. The tool currently has command-line flags to do everything we need it to do (I believe) but we could also use it as an example to use the api directly.
buildsys@lists.fedoraproject.org