<tl;dr> I've come up with a method of splitting repodata into chunks that can be downloaded and combined with chunks that are already on the local system to create a byte-for-byte copy of the compressed repodata. Tools and scripts are at: https://www.jdieter.net/downloads/ </tl;dr>
Background: With DNF, we're currently downloading ~20MB of repository data every time the updates repository changes.
When casync was released, I wondered if we could use it to only download the deltas for the repodata. At Flock last summer, I ran some tests against the uncompressed repodata and saw a reduction of 30-40% from one day to the next, which seemed low, but was a good starting point.
Unfortunately, due to the way casync separates each file into thousands of compressed chunks, building each file required thousands of (serial) downloads which, even on a decent internet connection, took *forever*.
When I talked through the idea with Kevin and Patrick, they also pointed out that our mirrors might not be too keen on the idea of adding thousands of tiny files that change every day.
The Solution(?): One potential solution to the "multitude of files" problem is to merge the chunks back into a single file, and use HTTP ranges to only download the parts of the file we want. An added bonus is that most web servers are configured to support hundreds of ranges in one request, which greatly reduces the number of requests we have to make.
The other problem with casync is that it's chunk separation is naïve, which is why we were only achieving 30-40% savings. But we know what the XML file is supposed to look like, so we can separate the chunks on the tag boundaries in the XML.
So I've ditched casync altogether and put together a proof-of-concept (tentatively named zchunk) that takes an XML file, compresses each tag separately, and then concatenates all of them into one file. The tool also creates an index file that tells you the sha256sum for each compressed chunk and the location of the chunk in the file.
I've also written a small script that will download a zchunk off the internet. If you don't specify an old file, it will just download everything, but if you specify an old file, it will download the index of the new file and compare the sha256sums of each chunk. Any checksums that match will be taken from the old file, and the rest will be downloaded.
In testing, I've seen savings ranging from 10% (December 17 to today) to 95% (yesterday to today).
Remaining problems: * Zchunk files are bigger than their gzip equivalents. This ranges from 5% larger for filelists.xml to 300% larger for primary.xml. This can be greatly reduced by chunking primary.xml based on srpm rather than rpm, which brings the size increase for primary.xml down to roughly 30%.
* Many changes to the metadata can mean a large number of ranges requested. I ran a check on our mirrors, and three (out of around 150 that had the file I was testing) don't honor range requests at all, and three others only honor a small number in a single request. A further seven didn't respond at all (not sure if that had anything to do with the range requests), and the rest supported between 256 and 512 ranges in a single request. We can reduce the number of ranges requested by always ordering our packages by date. This would ensure that new packages are grouped at the end of the xml where they will be grabbed in one contiguous range.
* Zchunk files use zlib (it gives better compression than xz with such small chunks), but, because they use a custom zdict, they are not gz files. This means that we'll need new tools to read and write them. (And I am volunteering to do the work here)
The tools: The proof-of-concept tools are all sitting in https://www.jdieter.net/downloads/zchunk-scripts/
They are full of ugly hacks, especially when it comes to parsing the XML, there's little to no error reporting, and I didn't comment them well at all, but they should work.
If all you want to do is download zchunks, you need to run dl_zchunk.py with the url you want to download (ending in .zck) as the first parameter. Repodata for various days over the last few weeks is at: https://www.jdieter.net/downloads/zchunk-test/ You may need to hover over the links to see which is which. The downloads directory is also available over rsync at rsync://jdieter.net/downloads/zchunk-test.
dl_zchunk.py doesn't show anything if you download the full file, but if you run the command with an old file as the second parameter, it will show four numbers: bytes taken from the old file, bytes downloaded from the new, total downloaded bytes and total uploaded bytes.
zchunk.py creates a .zck file. To group chunks by source rpm in primary.xml, run ./zchunk.py <file> rpm:sourcerpm
unzchunk.py decompresses a .zck file to stdout
I realize that there's a lot to digest here, and it's late, so I know I missed something. Please let me know if you have any suggestions, criticisms or flames, though it might be a few hours before I respond.
Jonathan
On Mon, 2018-02-12 at 23:53 +0200, Jonathan Dieter wrote:
<tl;dr> I've come up with a method of splitting repodata into chunks that can be downloaded and combined with chunks that are already on the local system to create a byte-for-byte copy of the compressed repodata. Tools and scripts are at: https://www.jdieter.net/downloads/ </tl;dr>
I've realized that with this, I didn't really give a proposal as to what comes next.
Proposal: * Create a new (Systemwide?) Feature for Fedora 29 called Delta Repodata? * Finalize the zchunk file format (and name) * Write a C and python library to generate, read and download zchunk files, and package it into Fedora * Add a flag to createrepo_c (and, if we're still using it, createrepo) to generate zchunk repodata * Modify DNF so it can download and read zchunk repodata
I don't mind being the person to drive this change, but before I start coding, I'd like some feedback on whether or not this is the direction we want to go in and I'd love any suggestions on how to improve it.
Jonathan
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
CCing rpm-ecosystem@ ML since it's main location where this message should have went 😉
On Mon, 2018-02-12 at 23:53 +0200, Jonathan Dieter wrote:
<tl;dr> I've come up with a method of splitting repodata into chunks that can be downloaded and combined with chunks that are already on the local system to create a byte-for-byte copy of the compressed repodata. Tools and scripts are at: https://www.jdieter.net/downloads/ </tl;dr>
Background: With DNF, we're currently downloading ~20MB of repository data every time the updates repository changes.
When casync was released, I wondered if we could use it to only download the deltas for the repodata. At Flock last summer, I ran some tests against the uncompressed repodata and saw a reduction of 30-40% from one day to the next, which seemed low, but was a good starting point.
Unfortunately, due to the way casync separates each file into thousands of compressed chunks, building each file required thousands of (serial) downloads which, even on a decent internet connection, took *forever*.
When I talked through the idea with Kevin and Patrick, they also pointed out that our mirrors might not be too keen on the idea of adding thousands of tiny files that change every day.
The Solution(?): One potential solution to the "multitude of files" problem is to merge the chunks back into a single file, and use HTTP ranges to only download the parts of the file we want. An added bonus is that most web servers are configured to support hundreds of ranges in one request, which greatly reduces the number of requests we have to make.
The other problem with casync is that it's chunk separation is naïve, which is why we were only achieving 30-40% savings. But we know what the XML file is supposed to look like, so we can separate the chunks on the tag boundaries in the XML.
So I've ditched casync altogether and put together a proof-of-concept (tentatively named zchunk) that takes an XML file, compresses each tag separately, and then concatenates all of them into one file. The tool also creates an index file that tells you the sha256sum for each compressed chunk and the location of the chunk in the file.
I've also written a small script that will download a zchunk off the internet. If you don't specify an old file, it will just download everything, but if you specify an old file, it will download the index of the new file and compare the sha256sums of each chunk. Any checksums that match will be taken from the old file, and the rest will be downloaded.
In testing, I've seen savings ranging from 10% (December 17 to today) to 95% (yesterday to today).
Remaining problems:
Zchunk files are bigger than their gzip equivalents. This ranges from 5% larger for filelists.xml to 300% larger for primary.xml. This can be greatly reduced by chunking primary.xml based on srpm rather than rpm, which brings the size increase for primary.xml down to roughly 30%.
Many changes to the metadata can mean a large number of ranges requested. I ran a check on our mirrors, and three (out of around 150 that had the file I was testing) don't honor range requests at all, and three others only honor a small number in a single request. A further seven didn't respond at all (not sure if that had anything to do with the range requests), and the rest supported between 256 and 512 ranges in a single request. We can reduce the number of ranges requested by always ordering our packages by date. This would ensure that new packages are grouped at the end of the xml where they will be grabbed in one contiguous range.
This would "break" DNF, because libsolv is assigning Id's by the order of packages in metadata. So if something requires "webserver" and there is "nginx" and "httpd" providing it (without versions), then lowest Id is picked up (not going into details of this). Which means depending on when last update for one or other was submitted, users will get different results. This is unacceptable from my POV.
- Zchunk files use zlib (it gives better compression than xz with such small chunks), but, because they use a custom zdict, they are not gz files. This means that we'll need new tools to read and write them. (And I am volunteering to do the work here)
What about zstd? Also in latest version of lz4 there is support for dictionaries too.
The tools: The proof-of-concept tools are all sitting in https://www.jdieter.net/downloads/zchunk-scripts/
They are full of ugly hacks, especially when it comes to parsing the XML, there's little to no error reporting, and I didn't comment them well at all, but they should work.
If all you want to do is download zchunks, you need to run dl_zchunk.py with the url you want to download (ending in .zck) as the first parameter. Repodata for various days over the last few weeks is at: https://www.jdieter.net/downloads/zchunk-test/ You may need to hover over the links to see which is which. The downloads directory is also available over rsync at rsync://jdieter.net/downloads/zchunk-test.
dl_zchunk.py doesn't show anything if you download the full file, but if you run the command with an old file as the second parameter, it will show four numbers: bytes taken from the old file, bytes downloaded from the new, total downloaded bytes and total uploaded bytes.
zchunk.py creates a .zck file. To group chunks by source rpm in primary.xml, run ./zchunk.py <file> rpm:sourcerpm
unzchunk.py decompresses a .zck file to stdout
I realize that there's a lot to digest here, and it's late, so I know I missed something. Please let me know if you have any suggestions, criticisms or flames, though it might be a few hours before I respond.
As being someone who tried to work on this problem I very appreciate what you have done here. We've started with using zsync and results were quite good, but zsync is dead and has ton of bugs. Also it requires archives to be ` - --rsyncable`. So my question is why not to add idx file as additional one for existing files instead of inventing new format? The problem is that we will have to distribute in old format too (for compatibility reasons).
I'm not sure if trying to do optimizations by XML tags is very good idea especially because I hope that in future we would stop distributing XML's and start distributing solv/solvx. - -- - -Igor Gnatenko
On Tue, 2018-02-13 at 10:52 +0100, Igor Gnatenko wrote:
On Mon, 2018-02-12 at 23:53 +0200, Jonathan Dieter wrote:
- Many changes to the metadata can mean a large number of ranges requested. I ran a check on our mirrors, and three (out of around 150 that had the file I was testing) don't honor range requests at all, and three others only honor a small number in a single request. A further seven didn't respond at all (not sure if that had anything to do with the range requests), and the rest supported between 256 and 512 ranges in a single request. We can reduce the number of ranges requested by always ordering our packages by date. This would ensure that new packages are grouped at the end of the xml where they will be grabbed in one contiguous range.
This would "break" DNF, because libsolv is assigning Id's by the order of packages in metadata. So if something requires "webserver" and there is "nginx" and "httpd" providing it (without versions), then lowest Id is picked up (not going into details of this). Which means depending on when last update for one or other was submitted, users will get different results. This is unacceptable from my POV.
That's fair enough, though how hard would it be to change libsolv to assign Id's based on alphabetical order as opposed to metadata order (or possibly reorder the xml before sending it to libsolv)?
To be clear, this optimization would reduce the number of range requests we have to send to the server, but would not hugely change the amount we download, so I don't think it's very high priority.
- Zchunk files use zlib (it gives better compression than xz with such small chunks), but, because they use a custom zdict, they are not gz files. This means that we'll need new tools to read and write them. (And I am volunteering to do the work here)
What about zstd? Also in latest version of lz4 there is support for dictionaries too.
I'll take a look at both of those.
As being someone who tried to work on this problem I very appreciate what you have done here. We've started with using zsync and results were quite good, but zsync is dead and has ton of bugs. Also it requires archives to be ` --rsyncable`. So my question is why not to add idx file as additional one for existing files instead of inventing new format? The problem is that we will have to distribute in old format too (for compatibility reasons).
I'm not sure if it was clear, but I'm basically making --rsyncable archives with more intelligent divisions between the independent blocks, which is why it gives better delta performance... you're not getting *any* redundant data.
I did originally experiment with xz files (a series of concatenated xz files is still a valid xz file), but the files were 20% larger than zlib with custom zdict.
The zdict helps us reduce file size by allowing all the chunks to use the same common strings that will not change (mainly tag names), but custom zdicts aren't allowed by gzip.
I've also toyed with the idea of supporting embedded idx's in zchunk files so we don't have to keep two files for every local zchunk file. We'd still want separate idx files on the webserver, though, otherwise we're looking at an extra http request to get the size of the index in the zchunk. If we embed the index in the file, we must create a new format as we don't want the index concatenated with the rest of the uncompressed file when decompressing.
I'm not sure if trying to do optimizations by XML tags is very good idea especially because I hope that in future we would stop distributing XML's and start distributing solv/solvx.
zchunk.py shouldn't care what type of data it's chunking, but it needs to be able to chunk the same way every time. Currently it only knows how to do that with XML, because we can split it based on tag boundaries, and grouping based on source rpm gives us even better compression without sacrificing any flexibility.
dl_zchunk.py and unzchunk.py neither know, nor care what type of file they're working with.
Thanks so much for the feedback, and especially for the pointers to lz4 and zstd. Hopefully they'll get us closer to matching our current gz size.
Jonathan
On Tue, 2018-02-13 at 10:52 +0100, Igor Gnatenko wrote:
What about zstd? Also in latest version of lz4 there is support for dictionaries too.
So I've investigated zstd, and, here are my results:
Latest F27 primary.gz - 3.1MB
zlib zchunk (including custom dict) primary.zck - 4.2MB ~35% increase
zstd zchunk (including dict generated from last three Fedora GA primaries) primary.zck - 3.7MB ~20% increase
Using zstd for filelists.xml has roughly the same increase as with zlib, which is expected as the chunks are larger and thus get better compression even without a dict.
I did also look briefly at lz4, but it seems that it's major advantage is speed, and I'm not sure that metadata decompression speed is our main bottleneck in dnf.
With these numbers, I think it makes sense to move forward with zstd instead of zlib.
Jonathan
...snip...
I think it sounds interesting, but you should get buyin from dnf folks and/or PackageKit folks and see if they can agree to use this format.
I also agree just adding it as a new file while leaving the rest alone sounds good as a way to migrate only those things that know to look for the new file when it exists, etc.
kevin
On Wed, 2018-02-14 at 09:56 -0800, Kevin Fenzi wrote:
...snip...
I think it sounds interesting, but you should get buyin from dnf folks and/or PackageKit folks and see if they can agree to use this format.
Do you know if there's a dedicated list for dnf or PackageKit development (a quick Google search didn't turn up anything), or should I communicate with them directly? If the latter, can you point me to the right people?
I also agree just adding it as a new file while leaving the rest alone sounds good as a way to migrate only those things that know to look for the new file when it exists, etc.
+1
Jonathan
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
On Thu, 2018-02-15 at 08:35 +0200, Jonathan Dieter wrote:
On Wed, 2018-02-14 at 09:56 -0800, Kevin Fenzi wrote:
...snip...
I think it sounds interesting, but you should get buyin from dnf folks and/or PackageKit folks and see if they can agree to use this format.
Do you know if there's a dedicated list for dnf or PackageKit development (a quick Google search didn't turn up anything), or should I communicate with them directly? If the latter, can you point me to the right people?
They are supposed to listen to rpm-ecosystem@lists.rpm.org 😉 But I will CC al l of them anyway. - -- - -Igor Gnatenko
So here's my proposed file format for the zchunk file. Should I add some flags to facilitate possible different compression formats?
+-+-+-+-+-+-+-+-+-+-+-+-+==================+=================+ | ID | Index size | Compressed Index | Compressed Dict | +-+-+-+-+-+-+-+-+-+-+-+-+==================+=================+
+===========+===========+ | Chunk | Chunk | ==> More chunks +===========+===========+
ID '\0ZCK', identifies file as zchunk file
Index size This is a 64-bit unsigned integer containing the size of compressed index.
Compressed Index This is the index, which is described in the next section. The index is compressed using standard zstd compression without a custom dictionary.
Compressed Dict This is a custom dictionary used when compressing each chunk. Because each chunk is compressed completely separately from the others, the custom dictionary gives us much better overall compression. The custom dictionary is compressed using standard zstd compression without using a separate custom dictionary (for obvious reasons).
Chunk This is a chunk of data, compressed using zstd with the custom dictionary provided above.
The index:
+++++++++++++++++++++++++++++++-+-+-+-+-+-+-+-+ | sha256sum | End of dict | +++++++++++++++++++++++++++++++-+-+-+-+-+-+-+-+
+++++++++++++++++++++++++++++++-+-+-+-+-+-+-+-+ | sha256sum | End of chunk | ==> More +++++++++++++++++++++++++++++++-+-+-+-+-+-+-+-+
sha256sum of compressed dict This is a binary sha256sum of the compressed chunk, used to detect whether two dicts are identical.
End of dict This is the location of the end of the dict with 0 being the end of
the index. This gives us the information we need to find and decompress the dict.
sha256sum of compressed chunk This is a binary sha256sum of the compressed chunk, used to detect
whether any two chunks are identical.
End of chunk This is the location of the end of the chunk with 0 being the end of the index. This gives us the information we need to find and decompress each chunk.
The index is designed to be able to be extracted from the file on the server and downloaded separately, to facilitate downloading only the parts of the file that are needed, but must then be re-embedded when assembling the file so the user only needs to keep one file.
infrastructure@lists.fedoraproject.org