On Jan 30, 2005, Jeff Johnson n3npq@nc.rr.com wrote:
Alexandre Oliva wrote:
Err... By the time yum or any other depsolver decides to download a package, it's already got all the headers for all packages. And I
Yep, "already got" so lets's go get the header again.
I see. Still, it's probably unwise to get the transaction verification procedure waiting for all big packages to download, or competing for limited bandwidth, when the headers alone would have sufficed.
An idea to overcome this issue without throwing away possible web caching benefits would be to start a download of the entire rpm and, by the time you get to the end of the header, you stop reading from that connection until you've completed the transaction verification.
If you have a web proxy, it will likely keep on downloading the entire package, and you'll end up downloading the rest of it very quickly, but the downloads will be competing for bandwidth.
If you don't have a web proxy, however, things may get messy: not only will you get competition for bandwidth, you'll also get competition for any limit on open connections that may be imposed on you by upstream (ISP, download server, etc). (My DSL provider, for example, won't let me establish more than 30 TCP connections simultaneously)
hope you're not suggesting yum to get rpm to download *all* packages just because it needs headers. *That* would be a waste of bandwidth
Depends on how yum implements, but we agree that "all" is stupid, even if we appear to disagree whether headers being downloaded and then downloaded again is stupid.
We do agree on both accounts.
into /var/cache/yum/repo/packages since you already know the header byte range you are interested in from the xml metadata, thereby saving the bandwidth used by reading the header twice.
Hmm... I hope you're not saying yum actually fetches the header portion out of the rpm files for purposes of dep resolution. Although I realize the information in the .xml file makes it perfectly possible, it also makes it (mostly?) redundant. Having to download not only the big xml files but also all of the headers would suck in a big way!
The rpmlib API requires a header for a ride. So yes, that is exactly what is happening, yum is using byte ranges to pull headers from discovered packages where (if discovered, packages are needed) both header+payload could be pulled togethere and asynchronously.
I hope you're really not saying that, if I request to install package foo, that depends on bar, it will also download headers for baz, a totally unrelated package. I can see that we'd need headers for foo and bar, but not for baz. I thought the point of the xml files and the info on provides, filelists, etc, was precisely to enable the depsolver to avoid having to download the headers for every package.
I'm wondering if it would be possible for a depsolver to create a (smaller) .hdr file out of info in the .xml files, and feed that to rpmlib for transaction-verification purposes. This would enable it to skip the download-header step before downloading the entire package.
The repo data is a win over previous incarnations of yum becuase it's one, not hundreds, of files that needs to be downloaded.
It's clear to me that it's a win for a one-shot download. What's not clear is that, downloading the entire .xml files 2-3 times a day, or every time it's updated (which rhn-applet would presumably do, although a simple listing of package N-V-Rs would be enough for it), you won't end up wasting more bandwidth than having the .hdr files downloaded once and for all.
filtering the data (i.e. headers have changelogs and more that is useless baggage) is also a win.
Definitely. But couldn't we perhaps do it by intelligently filtering information out of the rpm header and, say, generating a single archive containing all of the info needed for depsolving and for rpmlib's transaction verification?
It might still be useful to have something like header.info, but have it compressed, and not only listing N-V-Rs, but also the header ranges should one want to download the headers for individual packages, as opposed to the xml files or equivalent, a previous version of which was downloaded in the past and whose data is still locally-available, and that has undergone very slight changes.
So the suggestion was to download the package, not the header, and then extract the header from local, not remote storage.
I see. Good one, if we can't help downloading the package (e.g., because rpmlib ends up deciding it can't install the packages)
I'd be very surprised if yum 2.1 actually worked this way. I expect far better from Seth, and from what I read during the design period of the metadata format, I understood that the point of the xml files was precisely to avoid having to download the hdr files in the first place. So why would they be needed? To get rpmlib to verify the transaction, perhaps?
What do you expect? There is no way to create a transaction using rpmlib without a header, a header is wired in the rpmlib API. So honk at failed rpmlib design, not otherwise.
I was expecting depsolving wouldn't require all the headers. And from what I gather from your reply, it indeed doesn't.
Look, repos currently change daily, perhaps twice daily.
They actually change more often than that. rawhide changes daily, yes. FC updates sometimes change several times in a single day, and then sometimes stay put for a few days. Could be once or twice a day, indeed. Other repos such as dag and freshrpms change more often than that, it seems to me. At least once a day would be an accurate description for them.
Trying to optimize incremenatl updates for something that changes perhaps twice a day is fluff.
Let me try some back-of-the-envelope calculations here. Consider an FC install that remains installed for 40 weeks (~ 9 months), and has a user permanently running rhn-applet, and whose administrator runs up2date once a day on average. Further consider that updates are released, on average, once a day, and that, on average, only two of the 7 weekly update runs actually have new packages to install (i.e., updates are generally published in batches)
Let's consider two scenarios: 1) using up2date with yum-2.0 (headers/) repos (whoever claimed up2date supported rpmmd repodata/ misled me :-) and 2) using yum-2.1 (repodata/) repos.
1) yum 2.0
16MiB) initial download, distro's and empty updates's hdrs
8MiB) daily (on average) downloads of header.info for updates, downloaded by rhn-applet, considering an average size of almost 30KiB, for 40 weeks. (both FC2 and FC3 updates for i386 have a header.info this big right now)
16MiB) .hdr files for updates, downloaded by the update installer. Current FC2 i386 headers/ holds 9832KiB, whereas FC3 i386 headers/ holds 8528KiB, but that doesn't count superseded updates, whose .hdr files are removed. The assumption is that each header is downloaded once. 16MiB is a guestimate, that I believe to be inflated. It doesn't take into account the duplicate downloads of header.info for updates, under the assumption that a web proxy would avoid downloading again what rhn-applet has already downloaded.
----
40MiB) just in metadata over a period of 9 months, total
2) yum 2.1
2.7MiB) initial download, distro's and empty updates' primary.xml.gz and filelists.xml.gz
68MiB) daily (on average) downloads of primary.xml.gz, downloaded by rhn-applet, considering an average size of 250KiB (FC2 updates's is 240KiB, whereas FC3's is 257KiB, plus about 1KiB for repomd.xml)
16MiB) .hdr files for updates, downloaded by the update installer (same as in case 1)
192MiB) filelists.xml.gz for updates, downloaded twice a week on average by the update installer, to solve filename dep.
----
278.7MiB) just in metadata over a period of 9 months, total
Looks like a waste of at least 238.7 MiB per user per 9-month install. Sure, it's not a lot, only 26.5MiB a month, but it's almost 6 times as much data being transferred for the very same purpose. How is that a win? Multiply that by the number of users pounding on your mirrors and it adds up to hundreds of GiB a month.
Of course there are some factors that can help minimize the wastage, for example, a web proxy serving multiple machines, one of which is updated before the others, will be able to serve the headers for yum 2.1 out of the cached .rpm files, so you transfer the headers by themselves only once for all machines, instead of once per machine. But then, yum 2.0 enables the web proxy to cache headers anyway, so this would be a win for both, and less so for yum 2.1 if you update multiple boxes in parallel.
Another factor is that you probably won't need filelists.xml.gz for every update. Maybe I don't quite understand how often it is needed, but even if I have to download it only once a month, that's still 64MiB over 9 months, more than the 40MiB total metadata downloaded over 9 months by yum 2.0.
The rpm-metadata is already a huge win, as the previous incarnation checked time stamps on hundreds and thousands of headers, not one primary file.
I don't know how yum 2.0 did it, but up2date surely won't even try to download a .hdr file if it already has it in /var/spool/up2date, so this is not an issue.
Sure there are further improvements, but busting up repo metadata ain't gonna be where the win is, there's little gold left in that mine.
repodata helps the initial download, granted, but it loses terribly in the long run.
I hope you're really not saying that, if I request to install package foo, that depends on bar, it will also download headers for baz, a totally unrelated package. I can see that we'd need headers for foo and bar, but not for baz. I thought the point of the xml files and the info on provides, filelists, etc, was precisely to enable the depsolver to avoid having to download the headers for every package.
Just so we don't go off into deeply uninformed space:
yum 2.0.X downloaded all the headers in the headers directory that it did NOT have installed. It figured this out by reading header.info. This file stored nevra + rpm location. So yum 2.0.X downloaded this file to see what new headers it needed, downloaded them, then got on with the process at hand.
I'm wondering if it would be possible for a depsolver to create a (smaller) .hdr file out of info in the .xml files, and feed that to rpmlib for transaction-verification purposes. This would enable it to skip the download-header step before downloading the entire package.
Talk to Paul Nasrat - he was working on that a while ago but I think he got stuck in some rabbit hole debugging something.
Definitely. But couldn't we perhaps do it by intelligently filtering information out of the rpm header and, say, generating a single archive containing all of the info needed for depsolving and for rpmlib's transaction verification?
you can't do that b/c file conflicts CAN NOT be calculated via rpm w/o having the full header and/or all the file information present.
I was expecting depsolving wouldn't require all the headers. And from what I gather from your reply, it indeed doesn't.
it requires all the headers of the packages involved, yes.
Let's consider two scenarios: 1) using up2date with yum-2.0 (headers/) repos (whoever claimed up2date supported rpmmd repodata/ misled me :-) and 2) using yum-2.1 (repodata/) repos.
- yum 2.0
16MiB) initial download, distro's and empty updates's hdrs
8MiB) daily (on average) downloads of header.info for updates, downloaded by rhn-applet, considering an average size of almost 30KiB, for 40 weeks. (both FC2 and FC3 updates for i386 have a header.info this big right now)
16MiB) .hdr files for updates, downloaded by the update installer. Current FC2 i386 headers/ holds 9832KiB, whereas FC3 i386 headers/ holds 8528KiB, but that doesn't count superseded updates, whose .hdr files are removed. The assumption is that each header is downloaded once. 16MiB is a guestimate, that I believe to be inflated. It doesn't take into account the duplicate downloads of header.info for updates, under the assumption that a web proxy would avoid downloading again what rhn-applet has already downloaded.
40MiB) just in metadata over a period of 9 months, total
yum 2.1
2.7MiB) initial download, distro's and empty updates' primary.xml.gz and filelists.xml.gz
68MiB) daily (on average) downloads of primary.xml.gz, downloaded by rhn-applet, considering an average size of 250KiB (FC2 updates's is 240KiB, whereas FC3's is 257KiB, plus about 1KiB for repomd.xml)
16MiB) .hdr files for updates, downloaded by the update installer (same as in case 1)
192MiB) filelists.xml.gz for updates, downloaded twice a week on average by the update installer, to solve filename dep.
278.7MiB) just in metadata over a period of 9 months, total
Looks like a waste of at least 238.7 MiB per user per 9-month install. Sure, it's not a lot, only 26.5MiB a month, but it's almost 6 times as much data being transferred for the very same purpose. How is that a win? Multiply that by the number of users pounding on your mirrors and it adds up to hundreds of GiB a month.
Another factor is that you probably won't need filelists.xml.gz for every update. Maybe I don't quite understand how often it is needed, but even if I have to download it only once a month, that's still 64MiB over 9 months, more than the 40MiB total metadata downloaded over 9 months by yum 2.0.
yum 2.1.x ONLY DOWNLOADS THE XML FILES WHEN IT NEEDS THEM.
go read the code and stop guessing.
it downloads repomd.xml everytime - that's < 1K. it downloads primary.xml.gz if the file has changed - that's typically < 1M.
it downloads filelists.xml.gz only when there is a file dep that it cannot resolve with primary.xml.gz.
I don't know how yum 2.0 did it, but up2date surely won't even try to download a .hdr file if it already has it in /var/spool/up2date, so this is not an issue.
yum 2.0.x certainly DID NOT download a .hdr file it already had. Sheesh, go read the code, stop making suppositions based on anecdotes.
repodata helps the initial download, granted, but it loses terribly in the long run.
only as the number of file deps outside of /etc/* and *bin/* increases.
if you keep the file deps in those paths then repodata is a huge win.
-sv
On Jan 30, 2005, seth vidal skvidal@phy.duke.edu wrote:
Definitely. But couldn't we perhaps do it by intelligently filtering information out of the rpm header and, say, generating a single archive containing all of the info needed for depsolving and for rpmlib's transaction verification?
you can't do that b/c file conflicts CAN NOT be calculated via rpm w/o having the full header and/or all the file information present.
You surely don't need the package description and the changelog for any of that, this was my point.
I was expecting depsolving wouldn't require all the headers. And from what I gather from your reply, it indeed doesn't.
it requires all the headers of the packages involved, yes.
For solving dependencies (as opposed to testing the transaction)?!?
yum 2.1.x ONLY DOWNLOADS THE XML FILES WHEN IT NEEDS THEM.
go read the code and stop guessing.
Go read my e-mail. This is all covered. I'm not guessing.
it downloads repomd.xml everytime - that's < 1K.
Check.
it downloads primary.xml.gz if the file has changed - that's typically < 1M.
Check.
it downloads filelists.xml.gz only when there is a file dep that it cannot resolve with primary.xml.gz.
Check.
All covered in my e-mail. *You* stop guessing.
I don't know how yum 2.0 did it, but up2date surely won't even try to download a .hdr file if it already has it in /var/spool/up2date, so this is not an issue.
yum 2.0.x certainly DID NOT download a .hdr file it already had. Sheesh, go read the code, stop making suppositions based on anecdotes.
I'm not making suppositions. Granted, I didn't read the code, only observed behavior. My analysis is still valid.
repodata helps the initial download, granted, but it loses terribly in the long run.
only as the number of file deps outside of /etc/* and *bin/* increases.
So you're saying the factor I put in to account for that too small? How much should it be to match reality? Is repodata still a win?
if you keep the file deps in those paths then repodata is a huge win.
I find that very hard to believe, since the downloads of primary.xml.gz alone are enough to get above what yum 2.0 would download. Go read my text!
You know, text is supposed to be easier to read than code, that's why we write comments. So instead of fighting I haven't read your code, how about you pay just a little bit of attention that actually matches *exactly* what's in both versions of your code? If you find some passage particularly difficult to understand, I may try to explain it in other words (I'm not a native English speaker, you know), but refraining from reading it just because you *think* it doesn't match what your code does (even though it does match it) is making a fool of yourself.
You know, text is supposed to be easier to read than code, that's why we write comments. So instead of fighting I haven't read your code, how about you pay just a little bit of attention that actually matches *exactly* what's in both versions of your code? If you find some passage particularly difficult to understand, I may try to explain it in other words (I'm not a native English speaker, you know), but refraining from reading it just because you *think* it doesn't match what your code does (even though it does match it) is making a fool of yourself.
Since we've reduced this to name calling I think I'll step away from the conversation.
-sv
On Jan 30, 2005, seth vidal skvidal@phy.duke.edu wrote:
Since we've reduced this to name calling I think I'll step away from the conversation.
Please don't do that before going through the numbers and analysis I posted and telling us what you think is wrong in them. I certainly don't intend this to turn into name calling. You were the one who wrote: ``Just so we don't go off into deeply uninformed space:´´ and ``Sheesh, go read the code, stop making suppositions based on anecdotes.´´, and then:
it downloads repomd.xml everytime - that's < 1K. it downloads primary.xml.gz if the file has changed - that's typically < 1M.
it downloads filelists.xml.gz only when there is a file dep that it cannot resolve with primary.xml.gz.
Which is precisely the model I had in mind when I posted my e-mail. So maybe I had read the code, or didn't have to.
But instead of showing whatever you found to be in error in the hard data I posted, you chose to attack points in which either I wasn't sure on how exactly up2date, yum-2.0 and yum-2.1 chose to address a certain issue (and whose exact difference didn't matter at all for purposes of the comparison), or in which I probably wasn't sufficiently clear to expose what I had in mind, because you understood something completely different or missed the analysis relevant to understand that passage that was elsewhere.
So let's please go back to the numbers and stop name calling. I apologize if I gave the impression that I wanted to get into this sort of discussion. I don't, and the only way to justify my actions is because I felt attacked when you implied I had no clue on what I was talking about. I'll be the first to admit I don't have complete knowledge, and I'm always eager to learn, but I don't see anything wrong with the analysis I posted, and your posting didn't show absolutely anything wrong with it, which to me shows I wasn't that clueless, after all.
On 30 Jan 2005 18:16:21 -0200, Alexandre Oliva aoliva@redhat.com wrote:
278.7MiB) just in metadata over a period of 9 months, total
That's about 1 megabyte per day. I'm hard pressed to say that it makes much difference in overall end-user traffic, especially seeing as you are probably an exception: it's much less than that for a "generic user" who has base, updates-released, and maybe freshrpms or fedora.us+livna configured.
Hell, I get about that much SPAM every day -- ~200 5k messages, that works up to about 1MiB.
Now, I had to download and install 180MiB of OpenOffice updates yesterday. THAT sucked. The amount of YUM truffic compared to that is simply inconsequential.
Network bandwidth is getting cheaper by the day. I'm not sure it's worth developing ulcers and hernias over each kilobyte. Essentially, it's the same argument as whether it's better to write code in C or in assembler -- at some point the benefit of having an abstracted environment that's easy to maintain wins over the "but it's so much larger in size!".
Regards,
On Jan 31, 2005, Konstantin Ryabitsev mricon@gmail.com wrote:
On 30 Jan 2005 18:16:21 -0200, Alexandre Oliva aoliva@redhat.com wrote:
278.7MiB) just in metadata over a period of 9 months, total
That's about 1 megabyte per day.
Yeah. Multiply that by a few thousand users, if you happen to run one of the mirrors...
I'm hard pressed to say that it makes much difference in overall end-user traffic, especially seeing as you are probably an exception: it's much less than that for a "generic user" who has base, updates-released, and maybe freshrpms or fedora.us+livna configured.
How can it be less if you're downloading stuff from more repos?
Hell, I get about that much SPAM every day -- ~200 5k messages, that works up to about 1MiB.
I'm sure I get more than that. But that's not the point.
The point is the new repodata format was proposed to improve on what yum had, but I'm convinced it's a step backwards as it stands. It does offer one immediate advantage to the user, namely, the faster download of information needed for an initial dependency resolution, but, in the long run, you end up waiting longer for downloads, on total.
Now, I had to download and install 180MiB of OpenOffice updates yesterday. THAT sucked. The amount of YUM truffic compared to that is simply inconsequential.
Yeah, it's a pain. It just doesn't *feel* that bad when you spread this wait over 9 months, but it *is* that bad.
Network bandwidth is getting cheaper by the day.
Yeah, sure, so let's just waste it to make up? Doesn't sound very clever to me.
at some point the benefit of having an abstracted environment that's easy to maintain wins over the "but it's so much larger in size!".
What if it's smaller and easy to maintain?
On Mon, 2005-01-31 at 02:13 -0200, Alexandre Oliva wrote:
The point is the new repodata format was proposed to improve on what yum had, but I'm convinced it's a step backwards as it stands. It does offer one immediate advantage to the user, namely, the faster download of information needed for an initial dependency resolution, but, in the long run, you end up waiting longer for downloads, on total.
Please, if we're going to change the repodata _again_, let's keep backwards compatibility this time. It's bad enough that yum in FC3 can't use repositories created on a standard FC2 system; we don't want that again in FC4.
[added rpm-metadata list]
On Jan 30, 2005, Jeff Johnson n3npq@nc.rr.com wrote:
Look, repos currently change daily, perhaps twice daily. Trying to optimize incremenatl updates for something that changes perhaps twice a day is fluff.
Generating an xdelta from the previous versions of the .xml.gz files to the current versions, along with the relative location of the alternate repomd.xml that described them, modified to indicate they're deltas between the two given timestamps doesn't sound like such a difficult or wasteful thing to do. We'd grow repomd.xml by a few bytes:
<data type="delta-chain"> <location href="repodata/$P-repodata.xml"/> <checksum type="sha">...</checksum> </data>
where $P stands for a prefix used to denote the previous generation of the repository. It could be a number, a timestamp, whatever, it doesn't matter. $P-repodata.xml could contain data such as:
<?xml version="1.0" encoding="UTF-8"?> <repomd xmlns="http://linux.duke.edu/metadata/repo"> <data type="delta-other"> <location href="repodata/$P-other.xdelta"/> <checksum type="sha">...</checksum> <timestamp>...</timestamp> <open-checksum type="sha">...</open-checksum> </data> <data type="delta-filelists"> <location href="repodata/$P-filelists.xdelta"/> <checksum type="sha">...</checksum> <timestamp>...</timestamp> <open-checksum type="sha">...</open-checksum> </data> <data type="delta-primary"> <location href="repodata/$P-primary.xdelta"/> <checksum type="sha">...</checksum> <timestamp>...</timestamp> <open-checksum type="sha">...</open-checksum> </data> <data type="delta-chain"> <location href="repodata/$PP-repodata.xml"/> <checksum type="sha">...</checksum> </data> </repomd>
The timestamps would be the same as those in the original repomd.xml file, and the checksums would be such that one could verify that (i) the delta file was downloaded correctly, and that (ii) the expanded original .xml.gz file that they give xdelta to have the delta applied to obtain the newer version of the .xml file matches what the delta expects (although IIRC xdelta already performs this check itself).
In this file, $PP would be the prefix for whatever previous version was available in the previous version of the repository, forming a linked list.
So anyone can walk the list until they find a delta whose timestamp matches that of the version they have, and then start downloading the xdeltas and applying them until they reach the current version.
This extension would be fully backward-compatible, since you're free to not follow the delta-chain if you like. And, if you do, it might turn out that you don't find a timestamp that matches the files you have, which would be unfortunate, but then, you'll only have downloaded a bunch of small .xml files, no big deal.
On Monday 31 January 2005 11:08, Alexandre Oliva wrote:
Generating an xdelta from the previous versions of the .xml.gz files to the current versions, along with the relative location of the alternate repomd.xml that described them, modified to indicate they're deltas between the two given timestamps doesn't sound like such a difficult or wasteful thing to do.
This could be driven by an optional parameter to createrepo, which provides a list of packages to create a delta with. If it were fully automatic, it would only be a download win for the user. If it were maintainer-driven, it would be a win for both user and repo.
I would rather not utilize xdelta, because you're still regenerating the entire thing. Having xmlets that virtually add/substract as a delta against primary.xml.gz would be optimal for both sides of the equation.
The xml formatting recommended is the way to go. Contents of delta are still up in the air.
Another advantage of the delta method, is that the on-disk pickled objects (or whatever back-end store is used) could be updated incrementally based on xml snippets coming in. Instead of regenerating the whole thing over again.
So, anyway, we can talk until our faces are blue. What we need is a candidate for this feature that can run against larger repositories. Rawhide and third parties could participate without major borkage with current yum.
Anyway, I'll poke at it and see what materializes.
On Jan 31, 2005, Jeff Pitman symbiont@berlios.de wrote:
This could be driven by an optional parameter to createrepo, which provides a list of packages to create a delta with.
Err... Why? We already have repodata/, and we're creating the new version in .repodata. We can use repodata/ however we like, I think.
If it were fully automatic, it would only be a download win for the user.
And the servers.
I would rather not utilize xdelta, because you're still regenerating the entire thing. Having xmlets that virtually add/substract as a delta against primary.xml.gz would be optimal for both sides of the equation.
But then Seth rejects the idea because it makes for unmaintainable code. And I sort of agree with him now that I see a simpler way to accomplish the same bandwidth savings.
Another advantage of the delta method, is that the on-disk pickled objects (or whatever back-end store is used) could be updated incrementally based on xml snippets coming in. Instead of regenerating the whole thing over again.
This is certainly a good point, but it is also trickier to get right. And it might also turn out to be bigger: if you have to list what went away, you're probably emitting more information than xdelta's `skip these many bytes'. It's like comparing diff with xdelta: diff is reversible because it contains what was removed and what as added (plus optional context), whereas xdelta only contains what was inserted and what portions of the original remained.
Getting inserts small is trivial; getting removals small might be trickier, and to take advantage of pickling we need the latter.
Unless... Does anyone feel like implementing an xml-aware xdelta-like program in Python? :-)
On Monday 31 January 2005 12:27, Alexandre Oliva wrote:
On Jan 31, 2005, Jeff Pitman symbiont@berlios.de wrote:
This could be driven by an optional parameter to createrepo, which provides a list of packages to create a delta with.
Err... Why? We already have repodata/, and we're creating the new version in .repodata. We can use repodata/ however we like, I think.
Because, the way I'd implement it would not use a binary diff, such as xdelta. See, you're thinking at the level of createrepo crunching on the entire thing over again. I'm not. I'm thinking about it from a certain subset of packages driven by a parameter.
From a gcc/make analogy viewpoint, you can view this as updating one or
two specs and running make on the whole source tree. Since you already have objects built, you rebuild the few you don't have, and relink at the very end. Here, the maintainer of the repo wins. Now, if we deferred the re-link to the user-end, then the download for the user would win, too.
I would rather not utilize xdelta, because you're still regenerating the entire thing. Having xmlets that virtually add/substract as a delta against primary.xml.gz would be optimal for both sides of the equation.
But then Seth rejects the idea because it makes for unmaintainable code. And I sort of agree with him now that I see a simpler way to accomplish the same bandwidth savings.
You got me. Not sure how the level of difficulty has changed at all. But, a couple of implementations wouldn't hurt. Shoot, it all might not save *anything*. Doing it first, then throwing it out is what we need now. If it works, great. If not, *shrug*, we live and learn.
Another advantage of the delta method, is that the on-disk pickled objects (or whatever back-end store is used) could be updated incrementally based on xml snippets coming in. Instead of regenerating the whole thing over again.
This is certainly a good point, but it is also trickier to get right. And it might also turn out to be bigger: if you have to list what went away, you're probably emitting more information than xdelta's `skip these many bytes'.
I would never go to this level of madness. My proposal is connected with generating Xml necessary for the job, not a low-level binary diff between two runs of createrepo. Before doing it like this, I'd explore the librsync option and just run createrepo once as usual and rsync transfer the diff across the line. Keeping track of two runs is, although a bit novel, a little too much.
thanks,
devel@lists.stg.fedoraproject.org