Hi-
I'm submitting a patch that provides an optimization to the isomd5sum media check functionality. Presently, the algorithm calculates a single md5sum after reading the entire disc, and reports a "PASS" or "FAIL" at that time. When there is corrupted data on the disc, a user will not know until the entire disc has been read--which can take quite a while. It would be optimal for the algorithm to exit the checking and report a failure as soon after reading the corrupted section as possible.
I've accomplished this by adding two extra fields to the application data header of the ISO image. The ISO 9660 spec allows for a ~512-character string, of which were 224 characters are used. I've added two fields, "FRAGMENT SUMS" and "FRAGMENT COUNT", after which 326 characters are used.
The "FRAGMENT COUNT" is an integer representing the number of fragments in which to break the entire ISO. The "FRAGMENT SUMS" is an ascii string of concatenated leading characters of md5sums of subsequent fragments.
I've #define'd the length of FRAGMENT SUMS to 60 characters, as this number is evenly divisible by (2, 3, 4, 5, 6, 10, 12, 15, 20, 30). I've also #define'd FRAGMENT COUNT to be 20, which means that ISO will be broken into 20 fragments, each fragment represented by a 3-character string (the first 3 characters of the fragment's md5sum) in FRAGMENT SUMS.
Looking at an example, the application data header used to look like: ISO MD5SUM = 1fca2222a947399d5b4a4226629be254;SKIPSECTORS = 15;RHLISOSTATUS=1;THIS IS NOT THE SAME AS RUNNING MD5SUM ON THIS ISO!!
It now looks like: ISO MD5SUM = 1fca2222a947399d5b4a4226629be254;SKIPSECTORS = 15;RHLISOSTATUS=0;FRAGMENT SUMS = 67bd3263cccfda2d851a493b9a238e8877c45bca78e276e94437dd9f26e4;FRAGMENT COUNT= 20;THIS IS NOT THE SAME AS RUNNING MD5SUM ON THIS ISO!!
The string [67bd3263cccfda2d851a493b9a238e8877c45bca78e276e94437dd9f26e4] is actually a concatenation of: md5(frag(0))[0-2] = 67b md5(frag(1))[0-2] = d32 md5(frag(2))[0-2] = 63c md5(frag(3))[0-2] = ccf md5(frag(4))[0-2] = da2 md5(frag(5))[0-2] = d85 md5(frag(6))[0-2] = 1a4 md5(frag(7))[0-2] = 93b md5(frag(8))[0-2] = 9a2 md5(frag(9))[0-2] = 38e md5(frag(10))[0-2] = 887 md5(frag(11))[0-2] = 7c4 md5(frag(12))[0-2] = 5bc md5(frag(13))[0-2] = a78 md5(frag(14))[0-2] = e27 md5(frag(15))[0-2] = 6e9 md5(frag(16))[0-2] = 443 md5(frag(17))[0-2] = 7dd md5(frag(18))[0-2] = 9f2 md5(frag(19))[0-2] = 6e4
Thus, if any one of these prescribed strings does not match what is calculated, the checking exits immediately, as a corruption has been detected. In the cases where the error is near the beginning of the disc, the time to identify the problem is greatly reduced.
Note that there is a chance of collision. When the disc is broken into 20 fragments, each fragment is represented by a 3-character string, yielding a collision 1 in 16^3=4096 cases. Such a collision only means that the user will not benefit from the optimization, as the entire disc will be checked by the last operation just like it has always been done and the error will be detected at that time. One can reduce the chances of collision by decreasing the number of fragments, or expanding the length of the FRAGMENT SUMS string.
This optimization occurs at virtually no performance cost. Media checking is overwhelmingly I/O bound, reading from disc. The additional calculations of smaller sums adds little additional processing time, as most of that time was being spent waiting on new data from disc.
Please consider for inclusion in Anaconda, and I'll be glad to answer any questions or address concerns. I think the patch is pretty simple, and isolated to two files.
Thanks! Dustin Kirkland
On Fri, Mar 25, 2005 at 02:59:25PM -0500, Dustin kirkland wrote:
Hi-
I'm submitting a patch that provides an optimization to the isomd5sum media check functionality. Presently, the algorithm calculates a single md5sum after reading the entire disc, and reports a "PASS" or "FAIL" at that time. When there is corrupted data on the disc, a user will not know until the entire disc has been read--which can take quite a while. It would be optimal for the algorithm to exit the checking and report a failure as soon after reading the corrupted section as possible.
I've accomplished this by adding two extra fields to the application data header of the ISO image. The ISO 9660 spec allows for a ~512-character string, of which were 224 characters are used. I've added two fields, "FRAGMENT SUMS" and "FRAGMENT COUNT", after which 326 characters are used.
The "FRAGMENT COUNT" is an integer representing the number of fragments in which to break the entire ISO. The "FRAGMENT SUMS" is an ascii string of concatenated leading characters of md5sums of subsequent fragments.
I've #define'd the length of FRAGMENT SUMS to 60 characters, as this number is evenly divisible by (2, 3, 4, 5, 6, 10, 12, 15, 20, 30). I've also #define'd FRAGMENT COUNT to be 20, which means that ISO will be broken into 20 fragments, each fragment represented by a 3-character string (the first 3 characters of the fragment's md5sum) in FRAGMENT SUMS.
Looking at an example, the application data header used to look like: ISO MD5SUM = 1fca2222a947399d5b4a4226629be254;SKIPSECTORS = 15;RHLISOSTATUS=1;THIS IS NOT THE SAME AS RUNNING MD5SUM ON THIS ISO!!
It now looks like: ISO MD5SUM = 1fca2222a947399d5b4a4226629be254;SKIPSECTORS = 15;RHLISOSTATUS=0;FRAGMENT SUMS = 67bd3263cccfda2d851a493b9a238e8877c45bca78e276e94437dd9f26e4;FRAGMENT COUNT= 20;THIS IS NOT THE SAME AS RUNNING MD5SUM ON THIS ISO!!
The string [67bd3263cccfda2d851a493b9a238e8877c45bca78e276e94437dd9f26e4] is actually a concatenation of: md5(frag(0))[0-2] = 67b md5(frag(1))[0-2] = d32 md5(frag(2))[0-2] = 63c md5(frag(3))[0-2] = ccf md5(frag(4))[0-2] = da2 md5(frag(5))[0-2] = d85 md5(frag(6))[0-2] = 1a4 md5(frag(7))[0-2] = 93b md5(frag(8))[0-2] = 9a2 md5(frag(9))[0-2] = 38e md5(frag(10))[0-2] = 887 md5(frag(11))[0-2] = 7c4 md5(frag(12))[0-2] = 5bc md5(frag(13))[0-2] = a78 md5(frag(14))[0-2] = e27 md5(frag(15))[0-2] = 6e9 md5(frag(16))[0-2] = 443 md5(frag(17))[0-2] = 7dd md5(frag(18))[0-2] = 9f2 md5(frag(19))[0-2] = 6e4
Thus, if any one of these prescribed strings does not match what is calculated, the checking exits immediately, as a corruption has been detected. In the cases where the error is near the beginning of the disc, the time to identify the problem is greatly reduced.
Note that there is a chance of collision. When the disc is broken into 20 fragments, each fragment is represented by a 3-character string, yielding a collision 1 in 16^3=4096 cases. Such a collision only means that the user will not benefit from the optimization, as the entire disc will be checked by the last operation just like it has always been done and the error will be detected at that time. One can reduce the chances of collision by decreasing the number of fragments, or expanding the length of the FRAGMENT SUMS string.
This optimization occurs at virtually no performance cost. Media checking is overwhelmingly I/O bound, reading from disc. The additional calculations of smaller sums adds little additional processing time, as most of that time was being spent waiting on new data from disc.
Please consider for inclusion in Anaconda, and I'll be glad to answer any questions or address concerns. I think the patch is pretty simple, and isolated to two files.
I like it! Just one question: why aren't you using the intermediate values of the full MD5?
Regards, Luciano Rocha
On Fri, 25 Mar 2005 22:57:51 +0000, Luciano Miguel Ferreira Rocha strange@nsk.no-ip.org wrote:
I like it! Just one question: why aren't you using the intermediate values of the full MD5?
Good point!
I suppose you mean dumping the MD5_Final(fragmd5sum, &fragmd5ctx) call and just looking at "md5ctx" as updated by MD5_Update(&md5ctx, buf, nread)?
Hadn't thought of that, mainly due to blissful ignorance of the actual functionality of MD5_*() functions as implemented by this library. Without turning the md5ctx into an md5sum via the MD5_Final() function, what do I do with the md5ctx value?
I'll check that out and update the patch, as I think that would save some time and effort crunching numbers...
Dustin
On Fri, 25 Mar 2005 22:57:51 +0000, Luciano Miguel Ferreira Rocha strange@nsk.no-ip.org wrote:
I like it! Just one question: why aren't you using the intermediate values of the full MD5?
Good point!
I suppose you mean dumping the MD5_Final(fragmd5sum, &fragmd5ctx) call and just looking at "md5ctx" as updated by MD5_Update(&md5ctx, buf, nread)?
Hadn't thought of that, mainly due to blissful ignorance of the actual functionality of MD5_*() functions as implemented by this library. Without turning the md5ctx into an md5sum via the MD5_Final() function, what do I do with the md5ctx value?
I'll check that out and update the patch, as I think that would save some time and effort crunching numbers...
Dustin
On 3/25/05, Luciano Miguel Ferreira Rocha strange@nsk.no-ip.org wrote:
I like it! Just one question: why aren't you using the intermediate values of the full MD5?
I've worked with this a little more, taking Luciano's suggestion to use the running context rather than maintaining a separate fragmd5ctx. I have it working (attached), but there's a side effect of doing it this way.
This version calls MD5_Final() on the md5ctx variable periodically (I assume that's what you meant--that's the only way I can see to actually extract a sum string out of a context variable). Each call to this function will fill a string with an md5 sum, but it also modifies the md5ctx value in the process. This ultimately results in a different md5 sum of the whole disc when compared to the original method where MD5_Final() is only called at the very end of the computing.
Thus, a certain degree of backward compatibility is lost, and I do not like this patch as much as my original patch. That is, if you implant the data in an ISO using the patch attached, the main md5sum implanted is different, and thus if you checked this ISO with an old version of checkisomd5 it would fail the disk. On the other hand, the previous patch I attached is both backward and forward compatible, at the expense of an extra variable, a couple of lines of code, and a few additional calculations.
If anyone has a cleaner way to do this, please let me know and I'll rework this. But at this time, I recommend my original patch (not the one attached).
Dustin
On Fri, Apr 15, 2005 at 11:38:47AM -0400, Dustin kirkland wrote:
On 3/25/05, Luciano Miguel Ferreira Rocha strange@nsk.no-ip.org wrote:
I like it! Just one question: why aren't you using the intermediate values of the full MD5?
I've worked with this a little more, taking Luciano's suggestion to use the running context rather than maintaining a separate fragmd5ctx. I have it working (attached), but there's a side effect of doing it this way.
This version calls MD5_Final() on the md5ctx variable periodically (I assume that's what you meant--that's the only way I can see to actually extract a sum string out of a context variable). Each call to this function will fill a string with an md5 sum, but it also modifies the md5ctx value in the process. This ultimately results in a different md5 sum of the whole disc when compared to the original method where MD5_Final() is only called at the very end of the computing.
Yes, MD5_Final destroys the md5ctx. I imagined MD5 had some function to get the value without destroying the information. Or at the very least an EVP_MD_CTX_copy alike.
One option is to create a copy of the md5ctx and do the MD5_Final on that one.
But I don't have the time to try that myself, sorry..
Regards, Luciano Rocha
On Fri, 2005-04-15 at 17:03 +0100, Luciano Miguel Ferreira Rocha wrote:
One option is to create a copy of the md5ctx and do the MD5_Final on that one.
Just having another context structure and copying the data with memcpy should get another perfectly valid context to run MD5_Final on.
On 4/15/05, Peter Jones pjones@redhat.com wrote:
On Fri, 2005-04-15 at 17:03 +0100, Luciano Miguel Ferreira Rocha wrote:
One option is to create a copy of the md5ctx and do the MD5_Final on that one.
Just having another context structure and copying the data with memcpy should get another perfectly valid context to run MD5_Final on.
Ok, you're both very right... And this fully well addresses all of the concerns I mentioned in the previous email. This was quite simple, I'm sorry I missed. Basically, two instances of:
memcpy(&fragmd5ctx, &md5ctx, sizeof(MD5_CTX)); MD5_Final(fragmd5sum, &fragmd5ctx);
Thus, we can continuously overwrite fragmd5ctx and never actually do the MD5_Final() on md5ctx until the very end.
So here's the third revision...
Dustin
On Fri, 2005-03-25 at 14:59 -0500, Dustin kirkland wrote: [... awesome stuff skipped ...]
What'd be really nice is if it displayed something at each checkpoint when "checkisomd5 --verbose" is used.
On 4/26/05, pjones pjones@redhat.com wrote:
What'd be really nice is if it displayed something at each checkpoint when "checkisomd5 --verbose" is used.
... thisfragsum[j] = '\0'; /* printf("\nFragment [%i]: %s ?= %s\n", previous_fragment, computedsum, thisfragsum); */ previous_fragment = current_fragment; ...
Something like this commented out debug line--already in the patch I previously submitted ;) I'll dress this up a bit and submit another patch.
Dustin
Peter,
Here's what the output looks like now... The percent complete tally will continue to update itself, plus you'll see a running status of the last fragment tested.
<screenshot> [dustin@t23 isomd5sum]$ ./checkisomd5 --verbose test.iso test.iso: 1fca2222a947399d5b4a4226629be254 Fragment sums: 67b6bc7835fc4fa898e1af3d7e58d11f29a776167aa2741aacb45a1762f4 Fragment count: 20 Percent complete: 043.0% Fragment[09/20] -> OK </screenshot>
--- When there's a failure, it'll looks something like this:
<screenshot>[dustin@t23 isomd5sum]$ ./checkisomd5 --verbose bad.iso bad.iso: 1fca2222a947399d5b4a4226629be254 Fragment sums: 67b6bc7835fc4fa898e1af3d7e58d11f29a776167aa2741aacb45a1762f4 Fragment count: 20 Percent complete: 066.7% Fragment[14/20] -> OK Fragment 15 of 20 is BAD! The supported flag value is 0 The media check is complete, the result is: FAIL.
It is not recommended to use this media. </screenshot>
Jeremy-
A full patch of both files is attached.
Dustin
anaconda-devel@lists.stg.fedoraproject.org