Hi-
I'm submitting a patch that provides an optimization to the isomd5sum
media check functionality. Presently, the algorithm calculates a
single md5sum after reading the entire disc, and reports a "PASS" or
"FAIL" at that time. When there is corrupted data on the disc, a user
will not know until the entire disc has been read--which can take
quite a while. It would be optimal for the algorithm to exit the
checking and report a failure as soon after reading the corrupted
section as possible.
I've accomplished this by adding two extra fields to the application
data header of the ISO image. The ISO 9660 spec allows for a
~512-character string, of which were 224 characters are used. I've
added two fields, "FRAGMENT SUMS" and "FRAGMENT COUNT", after which
326 characters are used.
The "FRAGMENT COUNT" is an integer representing the number of
fragments in which to break the entire ISO. The "FRAGMENT SUMS" is an
ascii string of concatenated leading characters of md5sums of
subsequent fragments.
I've #define'd the length of FRAGMENT SUMS to 60 characters, as this
number is evenly divisible by (2, 3, 4, 5, 6, 10, 12, 15, 20, 30).
I've also #define'd FRAGMENT COUNT to be 20, which means that ISO will
be broken into 20 fragments, each fragment represented by a
3-character string (the first 3 characters of the fragment's md5sum)
in FRAGMENT SUMS.
Looking at an example, the application data header used to look like:
ISO MD5SUM = 1fca2222a947399d5b4a4226629be254;SKIPSECTORS =
15;RHLISOSTATUS=1;THIS IS NOT THE SAME AS RUNNING MD5SUM ON THIS ISO!!
It now looks like:
ISO MD5SUM = 1fca2222a947399d5b4a4226629be254;SKIPSECTORS =
15;RHLISOSTATUS=0;FRAGMENT SUMS =
67bd3263cccfda2d851a493b9a238e8877c45bca78e276e94437dd9f26e4;FRAGMENT
COUNT= 20;THIS IS NOT THE SAME AS RUNNING MD5SUM ON THIS ISO!!
The string [67bd3263cccfda2d851a493b9a238e8877c45bca78e276e94437dd9f26e4]
is actually a concatenation of:
md5(frag(0))[0-2] = 67b
md5(frag(1))[0-2] = d32
md5(frag(2))[0-2] = 63c
md5(frag(3))[0-2] = ccf
md5(frag(4))[0-2] = da2
md5(frag(5))[0-2] = d85
md5(frag(6))[0-2] = 1a4
md5(frag(7))[0-2] = 93b
md5(frag(8))[0-2] = 9a2
md5(frag(9))[0-2] = 38e
md5(frag(10))[0-2] = 887
md5(frag(11))[0-2] = 7c4
md5(frag(12))[0-2] = 5bc
md5(frag(13))[0-2] = a78
md5(frag(14))[0-2] = e27
md5(frag(15))[0-2] = 6e9
md5(frag(16))[0-2] = 443
md5(frag(17))[0-2] = 7dd
md5(frag(18))[0-2] = 9f2
md5(frag(19))[0-2] = 6e4
Thus, if any one of these prescribed strings does not match what is
calculated, the checking exits immediately, as a corruption has been
detected. In the cases where the error is near the beginning of the
disc, the time to identify the problem is greatly reduced.
Note that there is a chance of collision. When the disc is broken
into 20 fragments, each fragment is represented by a 3-character
string, yielding a collision 1 in 16^3=4096 cases. Such a collision
only means that the user will not benefit from the optimization, as
the entire disc will be checked by the last operation just like it has
always been done and the error will be detected at that time. One can
reduce the chances of collision by decreasing the number of fragments,
or expanding the length of the FRAGMENT SUMS string.
This optimization occurs at virtually no performance cost. Media
checking is overwhelmingly I/O bound, reading from disc. The
additional calculations of smaller sums adds little additional
processing time, as most of that time was being spent waiting on new
data from disc.
Please consider for inclusion in Anaconda, and I'll be glad to answer
any questions or address concerns. I think the patch is pretty
simple, and isolated to two files.
Thanks!
Dustin Kirkland