dev/null |binary rpms/README | 9 rpms/freedup/README | 9 rpms/freedup/freedup-1.5-7.fc13.src.rpm |binary rpms/freedups/ChangeLog | 190 ++++++ rpms/freedups/README | 273 +++++++++ rpms/freedups/freedups-0.6.14-0.noarch.rpm |binary rpms/freedups/freedups-0.6.14-0.src.rpm |binary rpms/freedups/freedups-0.6.14.spec | 81 ++ rpms/freedups/freedups-0.6.14.tar.gz |binary rpms/freedups/freedups-v0.6.14.pl | 800 +++++++++++++++++++++++++++++ 11 files changed, 1353 insertions(+), 9 deletions(-)
New commits: commit a8893f3cdde58dddcce0ad0065339207119dd7b8 Author: Robert 'Bob' Jensen bob@fedoraunity.org Date: Sat Sep 18 15:21:09 2010 -0500
Adding freedups files
diff --git a/rpms/freedup/README b/rpms/freedup/README new file mode 100644 index 0000000..b04d360 --- /dev/null +++ b/rpms/freedup/README @@ -0,0 +1,9 @@ +# simple description.. +# I have uploaded the src.rpm to here, because its still pending inclusion +# into the repos. Hopefully it wont be needed too much longer. +# +# I've tested it against 12/13/14 and rawhide (currently F15) +# enjoy! +# +# Note, the debuginfo packages don't work currently, so if you install the +# src.rpm, you're on your own. diff --git a/rpms/freedup/freedup-1.5-7.fc13.src.rpm b/rpms/freedup/freedup-1.5-7.fc13.src.rpm new file mode 100644 index 0000000..85bf6ac Binary files /dev/null and b/rpms/freedup/freedup-1.5-7.fc13.src.rpm differ diff --git a/rpms/freedups/ChangeLog b/rpms/freedups/ChangeLog new file mode 100644 index 0000000..4b2726a --- /dev/null +++ b/rpms/freedups/ChangeLog @@ -0,0 +1,190 @@ + +V0.6.7 - Apr 26, 2003 +- If the user requested DatesEqual, add mtime to the equivalence class. + + +V0.6.6 - Apr 26, 2003 +- InodeOfFile used to be used in Md5SumOf (could stand on its own), + was created and cleared in IndexFile, and heavily in LinkFiles. + Md5SumOf now creates an entry on demand, and the create and clear are + moved entirely to LinkFiles. In fact, we don't clear it at all now. + + +V0.6.5 - Apr 26, 2003 +- Start using size+uid+gid+mode as an equivalence class instead of size + in the InodesOfSize array. All the same tests are performed as + before, only we have fewer inodes to which the current one is + compared. +- Stop reconstructing the InodesOfSize{size} on every inode. Just + replace an entry in this array if it becomes necessary. +- Quiet the "Tried to link identical Inodes" message. + + +V0.6.4 - Apr 16, 2003 +- Brown paper bag time. I compared a new inode to each of the inodes of + its size, trying to find a link. The problem was, if it didn't match + any of them, I failed to add it to the list of inodes of that size so + it might match future inodes. For example, if I have two pairs of + identical files, the first pair gets linked, the second does not. The + regression test was updated to check that this works in the future. + My sincere thanks to Martin Sheppard and Milton Yates at csiro.au for + debugging, finding, and sending in a flawless fix for, this bug. + + +V0.6.3 - Mar 9, 2003 +- Reasonably big change; we're now processing files immediately as + they're read from disk rather than waiting until everyone's read into + memory. I'm hoping this will allow one to actually make some headway + even in the case where there's a huge number of files. I also suspect + it'll go faster as we're down from order ~1.5x num_files to 1x + num_files. This loses a bit of disk cache locality in processing the + nodes, but the lack of seeks and the fact that we don't have to wait + until everything's read in to make some progress should more than make + up for that. +- Because of the above, I no longer figure out how many nodes are + solitary versus multiple. +- Minor fixes. + + +V0.6.2 - Feb 26, 2003 +- Slight modification. All calculated sums get written to KnownMd5sums + and NewMd5sums. We use KnownMd5sums for all internal work, NewMd5sums + is only used for appending new sums to the cache at the end. + + +V0.6.1 - Feb 22, 2003 +- Break md5sums into known and new md5sums. Known sums came from the + cache and therefore don't need to be written out. new sums were + calculated on this run and are appended to cache at the end. +- Minor typos and fixes. + + +V0.6.0 - Dec 9, 2002 +- Cleanups of a stable 0.5.9. Removed a few variables. Old debugging + code removed. Move \n into Debug. +- LinkInodes doesn't call LinkFiles if ActuallyLink=no any more (there's + a mini version embedded in LinkInodes now that does the Debug prints). + + +V0.5.9 - Dec 9, 2002 +- Changed Inodespec storage format from slash delimited string to packed + SLSSSLLL format. Runtime peak memory for a 219800 file run went from + 87.8M to 78.8M; 10% memory savings. Informal numbers show it about 15% + faster as well. +- Wow. Instead of loading InodeOfFile during the initial file scan, I + leave it blank until we've discarded solitary inodes, and then I load + it with _just_ the files and inodes of the currently-being-worked-on + size. This brings the peak memory use for that same 219800 file run + down to 41.8M. Woah. +- For reference, given that v0.5.6 needed 1.08x ram for v0.5.7, v0.5.6 + would have needed 94.8M. We've saved 56% of our peak ram + requirements. +- On a P3-1500, I can process 219800 files (whose directory entires are + in disk cache and that are already linked) in 80 seconds. 2,747 + files/second. +- New regression tests, including a full link of two copies of a kernel + source tree and a diff afterwards. +- The truly verbose debugs show garbage if they try to print md5sums or + inodespecs, sorry. I'm guessing I'm the only person that sees them + anyways. + + +V0.5.8 - Nov 30, 2002 +- Print a reasonably accurate estimate of how much space would have been + saved on dry runs (-a turned off). +- Slightly restructure LinkInodes to reduce code repetition. + + +V0.5.7 - Nov 27, 2002 +- Don't use IndexedFiles{File}=0 test to guarantee unique files anymore, + use defined(@InodeOfFile{File}) which we have already. Saves 8% of + memory usage. :-) + + +V0.5.6 - Nov 24, 2002 +- Use seperate cache for every user - safer. +- Ignore files we can't stat for some reason +- Added regression test, to be run on every new version + + +V0.5.5 - Nov 22, 2002 +- Added code overview at the top. +- Show the size we're working on at each new link (if size has changed + from last time). +- Ignore blank md5sums. +- Discard the md5sum of an inode if we perform the last unlink on that + inode. + + +V0.5.4 - Nov 20, 2002 +- Slightly different array syntax, per Ross Carlson + + +V0.5.3 - Nov 17, 2002 +- Stop using :::: as a separator between the filenames in FilesOfInode; + make the FilesOfInode values real arrays. + + +V0.5.2 - Nov 4, 2002 +- Discard solitary inodes early to (theoretically) save memory (note + that perl (5, at least) doesn't actually return memory to the OS if + the app undef's it). + + +V0.5.1 - Nov 3, 2002 +- IndexFile function to load all arrays. +- Load md5sum cache late + + +V0.5 - Nov 3, 2002 +- Freedups has been rewritten in perl. +- First perl release with the following features: + - (shared) md5 checksum cache + - Read filenames in and stat them, storing inode info in internal arrays. + - Do comparison of _inodes_, not filenames. + - If a given size has a single inode, discard it as there's no chance of linking. + + +V0.4 - May 6, 2001 +- v0.3 and below were spending a _lot_ of time forking basename, even + when we didn't need to test for basename. By removing that and + grouping files with identical md5sums together, it processes large + numbers of files in about a tenth of the time. It does need to + read all the files now, some twice, but it's worth it for the speedup. + + +V0.3 - Mar 11, 2001 +- Handles command line parameters now. Setting options via environment + variables works for the moment, but will be removed in a future + version. +- Updated documentation. List of apps, more verbose answers to + questions. +- GPL text block added. +- Other minor fixes and cleanups. + +V0.2.1 - Mar 02, 2001 +- Added README and Changelog to package +- Don't debug by default in shipping version. +- Clean out more debugging code. +- minor code cleanups +- Add examples to Usage output. +- Use mktemp if available for temporary signature file. + + +V0.2 - Feb 23, 2001 +- Removal of a lot of forks and simplification of tests. +- More equivalency testing done in find's output. +- Link to the older of the two files or the file with the most links. + + +V0.1 - Feb 19, 2001 +- Basic search and link functionality +- Environment variables available: +- ACTUALLYLINK=YES #Just reports on potential savings if anything but YES. +- VERBOSE=YES #Show directory listing and wait before linking if YES. +- CHECKDATE=YES #Modified date and time must be equal to be considered for linking if YES. +- FILENAMESEQUAL=YES #Files must have the same name (in different directories to be considered for linking +- MINSIZE=size #Files must be larger than this size (in bytes) to be considered for linking. + + +- Not publicly released. diff --git a/rpms/freedups/README b/rpms/freedups/README new file mode 100644 index 0000000..5ac044c --- /dev/null +++ b/rpms/freedups/README @@ -0,0 +1,273 @@ + Freedups searches through the directories you specify. When it +finds two identical files, it hard links them together. Now the two or +more files still exist in their respective directories, but only one copy +of the data is stored on disk; both directory entries point to the same +data blocks. + This allows you to reclaim space on your drive. It's that +simple. Run it every night from a cron job. + + +Why you'd want to use it: + - You have multiple copies of a source code tree on your system. +Freedups will link any identical files together and ignore any files +that changed between versions. + - You have multiple copies of the file COPYING in /usr/doc or +/usr/share/DOC + - Depending on your system, the following might be good places +to try linking (size in parentheses is amount saved on a very basic +RedHat 7.3 install; you'll probably get even more savings): +freedups /lib/kbd (463K) +freedups /usr/doc /usr/share/doc +freedups /usr/src/linux* +freedups /usr/src/pcmcia-cs* +freedups /usr/share (8.6M) +freedups /usr/lib (97K) +freedups /usr/man /usr/share/man +freedups /usr/share/locale /etc/locale (652K) +freedups /usr/share/scrollkeeper /var/lib/scrollkeeper (719K) + - Directories holding files that are only read are good +candidates. + You might also find some space savings by deleting the +/usr/share/locale/country_code/LC_MESSAGES/*.mo files in country_codes +you don't need. + + +Things to watch out for: + - You'll need to use the _full path_, starting with /, to any +files or directories you want freedups to search. If you don't, you'll +likely get an error like "cannot stat file". + - Remember that you now have multiple directory entries pointing +at one block of data. Depending on what editor you use, when you change +one of the files you may be changing the others as well. See below for +a list of applications and whether they automatically handle hardlinks +or not. + - For the above reason, you probably don't want to create links +to any backup copies on the drive. + - If the files are on different partitions, it's not possible to +create a hardlink between them. Freedups handles this gracefully. + - Directories holding files that might be written to are +generally not good candidates. Similarly, avoid directories holding +security-related files. /etc is a bad choice on both counts. + - If you run freedups without the --datesequal=yes option, +freedups may link files with different modification times together. If +you later use "rpm -Va" (or the equivalent debian system verify +command), it may report that the timestamps on some files have changed. +If this is _all_ that has changed, this is a cosmetic problem only. For +example, the following is cosmetic and not indicative of a modified +file: + +.......T /usr/share/automake/COPYING + + +Can I run this and just see what would have been linked together without +modifying anything? + Sure. In fact, unless you put -a on the command line, that's +_all_ freedups will do. By default, it won't actually do anything, +it'll just tell you what the approximate space savings would be. + + +Does this really save any space? + It really depends on whether you have duplicate files on your +filesystem or not. I've personally recovered ~3G on my main drive from +hardlinking identical files in the various kernel trees I have there. +One user reports saving ~2G simply from hardlinking identical files +downloaded by a p2p file sharing program. + + +Does this slow down the system like the drive compression programs? + No. No files are compressed with this tool. It only instructs +the filesystem to keep one copy of two or more identical files and have +all their directory entries point at the sole copy of the actual file +data. In fact, for certain operations (such as using diff between two +freedup'd directory trees), the system runs much, much faster. + File reads should _not_ become slower. + Running freedups can take quite a while, but it can certainly +be run off-hours or when the system is generally idle. It can be run +under nice to give other programs priority. + + +Do I have to run this as root? + Not at all. As long as you own the files, freedups runs just +fine as a normal user. + + +What has to be true for two files to get linked together? + - They have to be files (i.e. not character or block devices, no +pipes, no directories, no symlinks). + - They have to have at least one byte. I don't want to link +all 0 byte files on the system together. + - They have to have the same size. + - They have to have the same user owner, group owner and mode. +Skirting this requirement would raise _serious_ security considerations. +If you want to link two files that currently differ in owner or mode, +use chown or chmod to make their owners or modes identical and re-run +freedups. + - They have to be readable by the current user. + - The contents of the files have to be identical. + - Optionally (--minsize=1000), the files have to be larger than +the given number of bytes. + - Optionally (--datesequal=yes), the files have to have identical +modification timestamps. + - Optionally (--filenamesequal=yes), the filenames have to be +identical (in different directories, obviously). + - They have to be on the same partition. + + - That partition must support hardlinks. Ext2, ext3 and +reiserfs do. I'm pretty sure fat/vfat/msdos do not. If you know whether +another linux filesystem supports hardlinks or not, please let me know. + + +I think I have a bunch of files that should be linked together, but +freedups doesn't link them. Why not? + Walk through the above list of criteria for a given pair of +files in question. Which one fails? + To examine a pair of files, look at the output from: + +ls -ali firstfile secondfile<Enter> + + which looks like: + +2097229 -rw-rw-r-- 1 wstearns wstearns 4 Mar 11 16:09 firstfile +2097673 -rw------- 1 nobody nobody 5 Mar 11 16:10 secondfile + + The columns are: inode number, file mode, number of links to +this inode, user owner, group owner, file size, modification date, +modification time, and filename. The above two files wouldn't be linked +because their modes are different, they're owned by different users, +they're owned by different groups, and have different sizes (so must +have different contents). Depending on options, they may also be +disqualified because their modification times and filenames are +different. + That said, if you do come up with files that legitimately should +be linked but aren't, please email me so I can fix freedups. + + +Can this be safely run more than once? + Definitely. Freedups is smart enough to recognize that two +files are already linked together and just moves on to the next pair. + For this reason, running it twice on the exact same set of +files won't save any more space. + +Are there different ways to do this? + Sure. + - Rewrite this in a more efficient language. + - When copying a directory tree, hard link the files during the +copy: + +cp -av --link linux-2.1.anything.orig linux-2.1.anything + + Many thanks to the Kernel FAQ and Janos Farkas for that trick. + + - Delete truly unneeded files + - Use CVS or Bitkeeper; the latter, at least, can save +substantial amounts of space. + + +How can I test that the program is working? + Try the following: +[wstearns@sparrow wstearns]$ cd /tmp +[wstearns@sparrow /tmp]$ mkdir duptest +[wstearns@sparrow /tmp]$ cd duptest +[wstearns@sparrow duptest]$ echo Hi there. >test1 +[wstearns@sparrow duptest]$ cp -p test1 test2 +[wstearns@sparrow duptest]$ ls -ali test1 test2 +1885113 -rw-rw-r-- 1 wstearns wstearns 10 Feb 28 00:55 test1 +1885114 -rw-rw-r-- 1 wstearns wstearns 10 Feb 28 00:55 test2 + + Note the different inode numbers - the total space used by these +two files is 20 bytes (actually 2 filesytem blocks, but that's a detail). + +[wstearns@sparrow duptest]$ freedups ./test1 ./test2 +Options chosen: None +About to check for links in " ./test1 ./test2" +10: Would have linked ./test2 and ./test1 +Total space would have saved: 10 (An overestimate if more than two files would have been linked together.) + + By default, it just reports what the savings would have been. + +[wstearns@sparrow duptest]$ freedups -a ./test1 ./test2 +Options chosen: ActuallyLink +About to check for links in " ./test1 ./test2" +10 Linked ./test2 and ./test1 +Total space saved: 10 (Small risk of overcounting space saved if linked files have different times.) +[wstearns@sparrow duptest]$ ls -ali test1 test2 +1885114 -rw-rw-r-- 2 wstearns wstearns 10 Feb 28 00:55 test1 +1885114 -rw-rw-r-- 2 wstearns wstearns 10 Feb 28 00:55 test2 + + Now both files share a single inode, so all but one copy is freed +and the free space rises accordingly. + For more examples, run freedups with the "-h" help option. + + +Application list + This list of applications shows whether they handle unlinking a +file before saving to it. I made an attempt on each to find an option +that allows one to change this behavior, but may not have found one. + Contributions and corrections are gratefully accepted. Here's +how to test: + +[wstearns@sparrow wstearns]$ cd /tmp +[wstearns@sparrow /tmp]$ mkdir linktest +[wstearns@sparrow /tmp]$ cd linktest +[wstearns@sparrow linktest]$ echo Hi there >test1 +[wstearns@sparrow linktest]$ ln -f test1 test2 +[wstearns@sparrow linktest]$ ls -ali test* +1885112 -rw-rw-r-- 2 wstearns wstearns 9 Mar 5 12:52 test1 +1885112 -rw-rw-r-- 2 wstearns wstearns 9 Mar 5 12:52 test2 +[wstearns@sparrow linktest]$ myprogram test1 + +#Replace myprogram with the program under test. +#In this program, add some characters to the file and save your changes. + +[wstearns@sparrow linktest]$ ls -ali test* +1885112 -rw-rw-r-- 2 wstearns wstearns 19 Mar 5 12:54 test1 +1885112 -rw-rw-r-- 2 wstearns wstearns 19 Mar 5 12:54 test2 + + The fact that the two files still share an inode and both +changed in content means that the link between test1 and test2 was +preserved. If, instead, you get: + +[wstearns@sparrow linktest]$ ls -ali test* +2236994 -rw-rw-r-- 2 wstearns wstearns 19 Mar 5 12:54 test1 +1885112 -rw-rw-r-- 2 wstearns wstearns 9 Mar 5 12:52 test2 + + , this means the program unlinked test1 before saving the +changes. + Note that neither behavior is "correct"; it's just that you +may prefer one over the other while working on a given file. + +Editor Action on save Notes +abiword-0.7.11 preserves link +bash-1.14.7's ">" preserves link +bash-1.14.7's ">>" preserves link +emacs-20.7 preserves link +gedit-0.9.2 preserves link +gnotepad+-1.3.1 preserves link #When "write backup file" turned off +gnotepad+-1.3.1 unlinks #When "write backup file" turned on +gnumeric-0.58 preserves link +gxedit-1.23 preserves link +jove-4.16.0.24 preserves link +kedit-1.1.2 preserves link #When "Backup Copies" turned off +kedit-1.1.2 unlinks #When "Backup Copies" turned on +lyx-0.12.0 preserves link +mcedit-4.5.51 preserves link #~/.mc/ini: editor_option_save_mode=0 (Save mode=quick save) +mcedit-4.5.51 unlinks #~/.mc/ini: editor_option_save_mode=1 (Save mode=safe save) +netscape-4.76 unlinks #Editor in netscape-communicator +nedit-5.1.1 preserves link +patch-2.5.4 unlinks +rpm-4.0 unlinks #on "-U" upgrade, at least. +rsync-2.3.2 unlinks #on server, hardlink is unlinked when a new version sent +vim-5.1 preserves link +wordperfect-7.0 preserves link #"Original document backup" has no effect; always preserves link. +xedit-3.3.2 preserves link + + +Contacts and credits. + Please send comments, suggestions, bug reports, patches, and/or +additions to the filesystem or applications list to William Stearns +wstearns@pobox.com . + Many thanks to Kevin Burton for his constructive suggestions, +most of which made it into v0.3.0. Sorry, Kevin, it's still written in +bash. :-) + + diff --git a/rpms/freedups/freedups-0.6.14-0.noarch.rpm b/rpms/freedups/freedups-0.6.14-0.noarch.rpm new file mode 100644 index 0000000..b53af48 Binary files /dev/null and b/rpms/freedups/freedups-0.6.14-0.noarch.rpm differ diff --git a/rpms/freedups/freedups-0.6.14-0.src.rpm b/rpms/freedups/freedups-0.6.14-0.src.rpm new file mode 100644 index 0000000..7395a63 Binary files /dev/null and b/rpms/freedups/freedups-0.6.14-0.src.rpm differ diff --git a/rpms/freedups/freedups-0.6.14.spec b/rpms/freedups/freedups-0.6.14.spec new file mode 100644 index 0000000..884f1e7 --- /dev/null +++ b/rpms/freedups/freedups-0.6.14.spec @@ -0,0 +1,81 @@ +%define version 0.6.14 +Name: freedups +Summary: Hardlinks identical files to save space. +Version: %{version} +Release: 0 +Copyright: GPL +Packager: William Stearns wstearns@pobox.com +Group: Applications/File +Source: ftp://ftp.stearns.org/pub/freedups-%{version}.tar.gz +Prereq: perl perl(File::Find) perl(File::Compare) perl(File::Basename) perl(File::stat) perl(Digest::MD5) perl(IO::File) perl(Getopt::Long) +Buildarch: noarch +Vendor: William Stearns wstearns@pobox.com +URL: http://www.stearns.org/freedups/ +BuildRoot: /tmp/freedups-broot + + +%description +Freedups hardlinks identical files to save space. For files that are +generally read from and not written to, this can provide a +significant space savings with no performance degredation. In fact, +in a small number of cases, this can speed up the system. + + +%prep +%setup + + +%install +if [ "$RPM_BUILD_ROOT" = "/tmp/freedups-broot" ]; then + rm -rf $RPM_BUILD_ROOT + + install -d $RPM_BUILD_ROOT/usr/bin + cp -p freedups.pl $RPM_BUILD_ROOT/usr/bin/freedups +else + echo Invalid Build root + exit 1 +fi + + +%clean +if [ "$RPM_BUILD_ROOT" = "/tmp/freedups-broot" ]; then + rm -rf $RPM_BUILD_ROOT +else + echo Invalid Build root + exit 1 +fi + + +%files +%defattr(-,root,root) +%attr(755,root,root) /usr/bin/freedups + %doc README ChangeLog + + +%changelog +* Sun Mar 14 2003 William Stearns wstearns@pobox.com +- Updated source to 0.6.14 + +* Mon Dec 9 2002 William Stearns wstearns@pobox.com +- Updated source to 0.6.0 + +* Mon Dec 9 2002 William Stearns wstearns@pobox.com +- Updated source to 0.5.9 + +* Sat Nov 30 2002 William Stearns wstearns@pobox.com +- Updated source to 0.5.8 + +* Thu Nov 28 2002 William Stearns wstearns@pobox.com +- Updated source to 0.5.7, switch over to perl + +* Wed May 09 2001 William Stearns wstearns@pobox.com +- Updated source to 0.4, md5sum prereq. + +* Sun Mar 11 2001 William Stearns wstearns@pobox.com +- Updated source to 0.3, updated prerequisites. + +* Fri Mar 02 2001 William Stearns wstearns@pobox.com +- Updated source to 0.2.1, minor specfile updates. + +* Tue Feb 27 2001 William Stearns wstearns@pobox.com +- v0.2 First beta release for comment. diff --git a/rpms/freedups/freedups-0.6.14.tar.gz b/rpms/freedups/freedups-0.6.14.tar.gz new file mode 100644 index 0000000..cdbc6fd Binary files /dev/null and b/rpms/freedups/freedups-0.6.14.tar.gz differ diff --git a/rpms/freedups/freedups-v0.6.14.pl b/rpms/freedups/freedups-v0.6.14.pl new file mode 100644 index 0000000..5ceb1ce --- /dev/null +++ b/rpms/freedups/freedups-v0.6.14.pl @@ -0,0 +1,800 @@ +#!/usr/bin/perl +#Copyright 2002, William Stearns wstearns@pobox.com +#Released under the GPL. + +#Code overview: +# +# This program is given a set of directories to scan. It uses the +#perl equivalent of find to stat all files in those directories; these +#files are loaded into the %InodesOfSize, %InodeOfFile, and +#%FilesOfInode hashes. +# +# We now need to find Inodes that might be linkable with each +#other. For a given size, if we only know of one inode that large, we +#can immediately forget about it since there's nothing we could possibly +#link it to (solitary inodes). We then walk through the remaining +#inodes, starting with the largest. For every pair of same-sized inodes, +#we check to see if that pair is linkable; if so, we pick one inode to +#stay as is (the more sparse, older, or already more heavily linked +#inode), and hardlink all filenames associated with the other inode to +#it. + +#FIXME - race where file replaced long after stat. +#FIXME - on ctrl-C perhaps write out cache. +#FIXME - Progress headers +#FIXME - support minsize and maxsize range +#FIXME - careful walk through bash version to compare. +#FIXME - print stats on which criteria used to decide who links to who +#FIXME - check all variable uses to make sure we're not printing packed data +#FIXME - This app currently requires full paths - make it more gracefully handle relative? +#FIXME - check debugs +#FIXME - hand down filename to IndexFile, don't use basename +#FIXME - reduce number of stats? + +use strict; +use File::Find (); +use File::Compare; +use File::Basename; + +use File::stat; +use Digest::MD5; +use IO::File; + +use Getopt::Long; + +use POSIX qw(getcwd); + +use vars qw/*name *dir *prune/; +*name = *File::Find::name; +*dir = *File::Find::dir; +*prune = *File::Find::prune; + +use constant MD5SUM_MIN_SAVE_SIZE => 8193; #Only files this size and larger will have their md5sums saved to the cache file. Any non-empty file will have its checksum saved in ram during a run; this only affects the save at the end of a run. + +my $FreedupsVer="0.6.14"; + +my %KnownMd5sums = ( ); #Key is packed device, inode, crucial characteristics, value is md5sum of that inode. On-disk cache is loaded into this and saved to from this. +my %NewMd5sums = ( ); #known holds the sums loaded from the cache; these obviously don't need to be rewritten back out. New holds new ones that need to be written out at the end. + +#The following hashes store info about files in the requested directory trees only; there could be others in the filesystem that freedups wasn't asked to index. +my %InodesOfSize; #All the inodes of the size key. This is a hash whose values are arrays. +my %InodeOfFile; #Provides the InodeSpec of the Filename key (this is loaded very late now; it only holds the files and InodeSpecs for the files of the size currently being processed). +my %FilesOfInode; #The filenames associated with the inode key. This is a hash whose values are arrays. + +my @PathIndex; +my @FileOf; +my @Paths; + +my $CurrentIndex = 0; + +my $NumSpecs = 0; #Number of command line directory/file specs in which to search for candidate files. +my $CachedSums = 0; #How many checksums we were able to pull from cache. +my $FromDiskSums = 0; #How many checksums we had to pull from media. +my $SpaceSaved = 0; #How many bytes saved. Takes into account whether we're removing the last link to the file or other links exist. +my $EstimatedSpaceSaved = 0; #How much space we would have saved if -a had been on. +#my $SolitaryInodeSizes = 0; #Sizes for which there was only one inode. #We no longer know this. +#my $MultipleInodeSizes = 0; #Sizes for which there was more than one inode. #We no longer know this either. +my $UniqueFilesScanned = 0; #How many unique filenames we inspected. +my $UniqueInodesScanned = 0; #How many unique inodes encountered during initial scan. +my $LastSizeLinked = 0; #For progress, what's the last linked inode size. +my $InternalChecks = 0; #Used during development for additional internal checks. +my $DroppedFilenames = 0; #How many filenames were discarded because there were already $MaxFiles filenames for that inode. +my $DiscardedSmallSums = 0; #Checksums we threw away at the save step because they were too small. + +#User options: +#1 (true) or 0 (false) +my $ActuallyLink = 0; #Actually link the files. Otherwise, just report on potential savings and preload the md5sum cache. +my $CacheFile = "$ENV{HOME}/md5sum-v1.cache"; #Private file that holds the inode=md5sum cache between runs. Must be created before program runs. +my $DatesEqual = 0; #Only link files if their modification times are identical +my $FileNamesEqual = 0; #Require that the two (pathless) filenames be equal before linking. +my $Help = 0; #Show help +my $MaxFiles = 0; #Maximum files to remember for a given inode; reduce to save ram. +my $MinSize = 0; #Files _larger_ than this many bytes are considered for linking. 0 byte files are _never_ considered. +#I've found at least one program bug (don't use == with alphanumeric md5sums) with Paranoid. I recommend leaving it on for now. +my $Paranoid = 1; #Set to 1 to force a strict compare just before linking. +my $Verbosity = 1; #0 = Just intro and stats, 1 = Normal, 2 = some debugging, 3 = Show me everything! #FIXME - prompts have not been strictly checked. + + + +#sub RealName { +# my $FileIndex = shift; +# return "$Paths[$PathIndex[$FileIndex]]/$FileOf[$FileIndex]"; +#} + + +sub Debug { + my $DebugLevel = shift; + if ($Verbosity >= $DebugLevel) { + my $DebugString = shift; + print "$DebugString\n"; + } +} #End sub Debug + + +#FIXME - new storage format +sub LoadSums { +#Load %KnownMd5sums from cache file + my $cache_filename = shift; + + undef $!; + if (my $cache_read_fh = IO::File->new($cache_filename, O_RDONLY)) { # |O_CREAT not used, security risk + my $loaded_pairs = 0; + undef $!; + + while (defined(my $cache_line = <$cache_read_fh>)) { #process one cache entry from local file. + chomp $cache_line; + my ($cache_inodespec, $cache_md5sum) = split(/=/, $cache_line, 2); + #Debug 4, "Read "$cache_inodespec,$cache_md5sum"."; + if ($cache_md5sum ne '') { + my ($Tdev, $Tino, $Tmode, $Tuid, $Tgid, $Tsize, $Tmtime, $Tctime) = split(///, $cache_inodespec); + $KnownMd5sums{pack("SLSSSLLL",$Tdev,$Tino,$Tmode,$Tuid,$Tgid,$Tsize,$Tmtime,$Tctime)} = pack("H32", $cache_md5sum); + $loaded_pairs++; + #} else { + # Debug 3, "Blank md5sum read for $cache_inodespec."; + } + } + close cache_read_fh; + #Debug 2, "Initial load: loaded $loaded_pairs cached KnownMd5sums from $cache_filename."; + } else { #Warn about missing or unreadable cache file. + Debug 1, "Local cache file $cache_filename unavailable or unreadable. Create it (with 'touch $cache_filename' perhaps) if it's not there and check permissions, please: $!"; + } #End of load cache file entries +} #End sub LoadSums + + +#FIXME - new storage format +sub SaveSums { +#Save %NewMd5sums to cache file + my $cache_filename = shift; + if (my $cache_write_fh = IO::File->new("$cache_filename", O_WRONLY|O_APPEND)) { # |O_CREAT not used, security risk. + foreach my $key (keys %NewMd5sums) { + if ($NewMd5sums{$key} ne '') { + my ($Tdev, $Tino, $Tmode, $Tuid, $Tgid, $Tsize, $Tmtime, $Tctime) = unpack("SLSSSLLL", $key); + if ($Tsize >= MD5SUM_MIN_SAVE_SIZE) { + print $cache_write_fh "$Tdev/$Tino/$Tmode/$Tuid/$Tgid/$Tsize/$Tmtime/$Tctime=", unpack("H32", $NewMd5sums{$key}), "\n"; + } else { + $DiscardedSmallSums++; + } + #} else { + # Debug 3, "Blank md5sum not written for $key."; + } + } + close $cache_write_fh; + } else { #Warn about missing or unwritable cache file. + Debug 1, "Local cache file $cache_filename unavailable or unwritable for storing new entries. Create it (with 'touch $cache_filename' perhaps) if it's not there and check permissions, please: $!."; + } +} #End sub SaveSums + + +sub MD5SumOf { +#This returns the md5sum of a given file. stat(file) returns the Inode +#info we need to pull the cached md5sum from %KnownMd5sums or we pull +#the checksum from disk and save it in %KnownMd5sums and %NewMd5sums (and hence, in the +#md5sum cache file) for future use. + my $SumIndex = shift or die "No file specified in MD5SumOf: $!"; + my $SumFile = "$Paths[$PathIndex[$SumIndex]]/$FileOf[$SumIndex]"; + + my $InodeSpec; + + if (! defined($InodeOfFile{$SumIndex})) { + my $sb = stat($SumFile); + $InodeOfFile{$SumIndex}=pack("SLSSSLLL",$sb->dev, $sb->ino, $sb->mode, $sb->uid, $sb->gid, $sb->size, $sb->mtime, $sb->ctime); + if ($InternalChecks) { + my ($Tdev, $Tino, $Tmode, $Tuid, $Tgid, $Tsize, $Tmtime, $Tctime) = unpack("SLSSSLLL", $InodeOfFile{$SumIndex}); + die $sb->dev . " ne " . $Tdev . ", exiting" if ($sb->dev ne $Tdev); + die $sb->ino . " ne " . $Tino . ", exiting" if ($sb->ino ne $Tino); + die $sb->mode . " ne " . $Tmode . ", exiting" if ($sb->mode ne $Tmode); + die $sb->uid . " ne " . $Tuid . ", exiting" if ($sb->uid ne $Tuid); + die $sb->gid . " ne " . $Tgid . ", exiting" if ($sb->gid ne $Tgid); + die $sb->size . " ne " . $Tsize . ", exiting" if ($sb->size ne $Tsize); + die $sb->mtime . " ne " . $Tmtime . ", exiting" if ($sb->mtime ne $Tmtime); + die $sb->ctime . " ne " . $Tctime . ", exiting" if ($sb->ctime ne $Tctime); + } + } + $InodeSpec = $InodeOfFile{$SumIndex}; + + #if (defined($NewMd5sums{$InodeSpec})) { + # $CachedSums++; + # ####$SumToReturn = $NewMd5sums{$InodeSpec}; + # Debug 3, "Checksum came from new cache."; #Add in ...unpack("SLSSSLLL", $InodeSpec)... perhaps later + #} els + if (defined($KnownMd5sums{$InodeSpec})) { + $CachedSums++; + #Debug 3, "Checksum came from known cache."; #Add in ...unpack("SLSSSLLL", $InodeSpec)... perhaps later + } else { + $FromDiskSums++; + open(FILE, $SumFile) or die "Can't open '$SumFile': $!"; + binmode(FILE); + #$KnownMd5sums{$InodeSpec} = Digest::MD5->new->addfile(*FILE)->hexdigest; + my $TempSum = Digest::MD5->new->addfile(*FILE)->hexdigest; + #FIXME Chained ... = ... = ...? + $KnownMd5sums{$InodeSpec} = pack("H32", $TempSum); + $NewMd5sums{$InodeSpec} = pack("H32", $TempSum); + if ($InternalChecks) { + if (unpack("H32", $KnownMd5sums{$InodeSpec}) ne $TempSum) { + die "Unpack failure " . $KnownMd5sums{$InodeSpec} . " ne " . $TempSum . ", exiting."; + } + } + #Debug 3, "Checksum came from physical disk."; #add in ...unpack("SLSSSLLL", $InodeSpec)... perhaps later + #Do I need to explicitly close here? + } + #Debug 3, "File: $SumFile, Sum: " . unpack("H32", $KnownMd5sums{$InodeSpec}) . "."; + return $KnownMd5sums{$InodeSpec}; +} #End sub MD5SumOf + + +sub IndexFile { + $CurrentIndex++; + + my $OneFile = shift; + $FileOf[$CurrentIndex] = shift; + + my $OnePath = dirname($OneFile); + + if ($OnePath eq $Paths[$#Paths]) { + #print '!'; + $PathIndex[$CurrentIndex] = $#Paths; + } else { + $PathIndex[$CurrentIndex] = $#Paths + 1; + $Paths[$#Paths + 1] = $OnePath; + } + #Because perl's find function appears to walk the directories sequentially, there's no point in looking at any but the last + #directory in the stack. + #} else { + # #print '^'; + # foreach my $PathWalk (reverse (1..$#Paths)) { + # #print "ZZ Comparing $OnePath to $PathWalk : $Paths[$PathWalk].\n"; + # if ($OnePath eq $Paths[$PathWalk]) { + # #print '.'; #Never get here; we never go back to an old directory. + # #print "ZZ found path at $PathWalk.\n"; + # $PathIndex[$CurrentIndex] = $PathWalk; + # $Done = 1; + # last; + # } + # } + #} + + #This was the simple approach that just made a new path entry for each file. + #$Paths[$CurrentIndex] = $OnePath; + #$PathIndex[$CurrentIndex] = $CurrentIndex; + + #print "ZZ $OneFile ZZ $OnePath ZZ $FileOf[$CurrentIndex] YY $Paths[$PathIndex[$CurrentIndex]]/$FileOf[$CurrentIndex]\n"; + #return 0; + + my $sb = stat($OneFile); + if (defined $sb) { + my $InodeSpec=pack("SLSSSLLL",$sb->dev, $sb->ino, $sb->mode, $sb->uid, $sb->gid, $sb->size, $sb->mtime, $sb->ctime); + #Quick note - InodesOfSize _used_ to be all the inodes of a gizen size, now it's all the inodes of a given equivalence class. + my $EquivClass; + if ($DatesEqual) { + $EquivClass=pack("LSSSL",$sb->size, $sb->uid, $sb->gid, $sb->mode, $sb->mtime); + } else { + $EquivClass=pack("LSSS",$sb->size, $sb->uid, $sb->gid, $sb->mode); + } + if ($InternalChecks) { + my ($Tdev, $Tino, $Tmode, $Tuid, $Tgid, $Tsize, $Tmtime, $Tctime) = unpack("SLSSSLLL", $InodeSpec); + die $sb->dev . " dev ne " . $Tdev . ", exiting" if ($sb->dev ne $Tdev); + die $sb->ino . " ino ne " . $Tino . ", exiting" if ($sb->ino ne $Tino); + die $sb->mode . " mode ne " . $Tmode . ", exiting" if ($sb->mode ne $Tmode); + die $sb->uid . " uid ne " . $Tuid . ", exiting" if ($sb->uid ne $Tuid); + die $sb->gid . " gid ne " . $Tgid . ", exiting" if ($sb->gid ne $Tgid); + die $sb->size . " size ne " . $Tsize . ", exiting" if ($sb->size ne $Tsize); + die $sb->mtime . " mtime ne " . $Tmtime . ", exiting" if ($sb->mtime ne $Tmtime); + die $sb->ctime . " ctime ne " . $Tctime . ", exiting" if ($sb->ctime ne $Tctime); + } + + #Check uniqueness by scanning FilesOfInode + my $FileAlreadyInFOI = 0; #False + foreach my $ExistingIndex (@{$FilesOfInode{$InodeSpec}}) { + if ($OneFile eq "$Paths[$PathIndex[$ExistingIndex]]/$FileOf[$ExistingIndex]") { + $FileAlreadyInFOI = 1; + last; #exit foreach now + } + } + if (! $FileAlreadyInFOI) { + #Already in there + # Debug 3, " NOT adding $OneFile to FilesOfInode, already there."; + #} else { + #Debug 3, " Adding $OneFile to FilesOfInode."; + + $UniqueFilesScanned++; + #if (defined($FilesOfInode{$InodeSpec})) { + +#FIXME (3x) handle and allow maxfiles=0 + if ($#{$FilesOfInode{$InodeSpec}} >= $MaxFiles) { + $DroppedFilenames++; + } else { + push @{$FilesOfInode{$InodeSpec}}, $CurrentIndex; + } + + #} else { + # $FilesOfInode{$InodeSpec} = [ $CurrentIndex ]; + #} + + if (defined($InodesOfSize{$EquivClass})) { + #Check to see if $InodeSpec already in $InodesOfSize{$EquivClass} + my $InodeAlreadyInIOS = 0; #False + foreach my $OneInodeSpec (@{$InodesOfSize{$EquivClass}}) { + if ($OneInodeSpec eq $InodeSpec) { + $InodeAlreadyInIOS = 1; + last; #exit foreach loop now. + } + } + if (! $InodeAlreadyInIOS) { + #Already in there + # Debug 3, " NOT Adding $InodeSpec to InodesOfSize, already there."; + #} else { + $UniqueInodesScanned++; + #Debug 3, " Adding $InodeSpec to InodesOfSize"; + #Old approach - next line - was to just add the new inode to InodesOfSize and come back to scan InodesOfSize later. + #push @{$InodesOfSize{$EquivClass}}, $InodeSpec; + + #Old approach recreated the @{$InodesOfSize{$EquivClass}} array on every inode. + #Now we'll just walk through it with a counter. + #my @CurrentInodes = @{$InodesOfSize{$EquivClass}}; + #@{$InodesOfSize{$EquivClass}} = ( ); #Start with a fresh list; we'll pull in the inodes one by one. [ ] instead? + + my $Done = 0; + #Now we compare InodeSpec to each of the existing InodesOfSize (stopping when we find a match) right now to see if there's a match. + foreach my $InIndex (0..$#{$InodesOfSize{$EquivClass}}) { + #my $OneExistingInode = @{$InodesOfSize{$EquivClass}}[$InIndex]; + my $DidWeLink = CheckForLinkableInodes(@{$InodesOfSize{$EquivClass}}[$InIndex], $InodeSpec); + if ($DidWeLink == 0) { + #No link was performed. Keep the Existing Inode in the list, keep going to see if any of the others match. + #push @{$InodesOfSize{$EquivClass}}, $OneExistingInode; + } elsif ($DidWeLink == 1) { + #Left inode kept, right inode linked to it. Keep the Existing Inode in the list, stop searching. + #push @{$InodesOfSize{$EquivClass}}, $OneExistingInode; + $Done = 1; + last; + } elsif ($DidWeLink == 2) { + #Right inode kept, left inode was linked to it. Put the new Inode in the new list, stop searching. + @{$InodesOfSize{$EquivClass}}[$InIndex] = $InodeSpec; + #push @{$InodesOfSize{$EquivClass}}, $InodeSpec; + $Done = 1; + last; + } else { + die "Unexpected return value ($DidWeLink) from CheckForlinkablenodes"; + } + } + #OK, we compared $InodeSpec to all the existing inodes of that size; no match. + #We need to add this to the list so that we can compare it to future inodes. + #Bill completely failed to notice this oversight; my _sincere_ thanks to + #Martin Sheppard for noticing the problem, debugging it, and sending in a patch. + if (!$Done) { + push @{$InodesOfSize{$EquivClass}}, $InodeSpec; + } + } + } else { + $UniqueInodesScanned++; + #Debug 4, " Initial add $InodeSpec to InodesOfSize"; + $InodesOfSize{$EquivClass} = [ $InodeSpec ]; + } + } + } else { + Debug 2, "Can't stat $OneFile: $!"; + } +} #End sub IndexFile + + +#Function provided by find2perl ... -type f -a -size +3366c +sub wanted { + my ($dev,$ino,$mode,$nlink,$uid,$gid); + + (($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($_)) && + -f _ && + (int(-s _) > $MinSize) && + IndexFile $File::Find::name, $_; + #IndexFile getcwd, $_; + #$File::Find::name is the full path and file name + #$File::Find::dir is the path to the file, but it's not a complete path + #$_ is the current filename +} #End sub wanted + + +sub LinkFiles { + #FIXME - this modifies %FilesOfInode and $InodeOfFile. Check to see if upper layers care. (partially checked) + #clean up? %md5sums{InodeSpec} - remove inode entry on last unlink. (DONE) + #clean up? %InodeOfFile{Filename} - Reset to new inode after hardlink and reinstate Identical Inode warning. (DONE) + #clean up? %FilesOfInode{InodeSpec} - Move file from old inode to new (DONE) + my $BaseIndex = shift; #Index of file which will stay as is + my $LinkIndex = shift; #Index of file which will end up as a link to BaseFile + + my $BaseFile = "$Paths[$PathIndex[$BaseIndex]]/$FileOf[$BaseIndex]" ; #Filename of file which will stay as is + my $LinkName = "$Paths[$PathIndex[$LinkIndex]]/$FileOf[$LinkIndex]" ; #Filename which will end up as a link to BaseFile + + if ( ($FileNamesEqual) && ($FileOf[$BaseIndex] ne $FileOf[$LinkIndex]) ) { + #Debug 3, "$BaseFile and $LinkName have different filenames at the end, not linking."; + return; + } + + my $TempSsb=stat($LinkName); + my $EquivClass; + if ($DatesEqual) { + $EquivClass=pack("LSSSL",$TempSsb->size, $TempSsb->uid, $TempSsb->gid, $TempSsb->mode, $TempSsb->mtime); + } else { + $EquivClass=pack("LSSS",$TempSsb->size, $TempSsb->uid, $TempSsb->gid, $TempSsb->mode); + } + #FIXME - moves this and the clear at the end outside of the loop in LinkInodes? + #Here we load $InodeOfFile{} with _just_ the files/inodes of the current size, pulling from FilesOfInode{InodesOfSize} + #I think I won't clear it at the moment. + #%InodeOfFile = ( ); + foreach my $InodeToIndex (@{$InodesOfSize{$EquivClass}}) { + foreach my $FileToIndex (@{$FilesOfInode{$InodeToIndex}}) { + $InodeOfFile{$FileToIndex} = $InodeToIndex; + } + } + if ($Paranoid) { + #Check that file hasn't been modified since it was stat'd right after find. I'm aborting the program if changes occur; that tends to point + #to a file that's actively being modified. This shouldn't happen. + #Note that the following are duplicate checks; the file has already passed these once. Failing now means that file(s) is/are actively being + #changed under us. + my $Fsb=stat($BaseFile); + my $Ssb=stat($LinkName); + if (!(-e "$BaseFile")) { + die ("LinkFile: $BaseFile no longer exists or is not a file anymore. Exiting."); + } + if (!(-e "$LinkName")) { + die ("LinkFile: $LinkName no longer exists or is not a file anymore. Exiting."); + } + if ( ! ( ($Fsb->mode == $Ssb->mode) && ($Fsb->uid == $Ssb->uid) && ($Fsb->gid == $Ssb->gid) && ($Fsb->size == $Ssb->size) ) ) { + Debug 0, "File1: $InodeOfFile{$BaseIndex}, Associated files: @{$FilesOfInode{$InodeOfFile{$BaseIndex}}}, md5sum: $KnownMd5sums{$InodeOfFile{$BaseIndex}}."; + Debug 0, "File2: $InodeOfFile{$LinkIndex}, Associated files: @{$FilesOfInode{$InodeOfFile{$LinkIndex}}}, md5sum: $KnownMd5sums{$InodeOfFile{$LinkIndex}}."; + die ("LinkFile: paranoid stat checks failed! Please check failure in linking $BaseFile and $LinkName. Exiting."); + } + if (compare("$BaseFile","$LinkName") != 0) { #Byte for byte compare not equal + Debug 0, "File1: $InodeOfFile{$BaseIndex}, Associated files: @{$FilesOfInode{$InodeOfFile{$BaseIndex}}}, md5sum: $KnownMd5sums{$InodeOfFile{$BaseIndex}}."; + Debug 0, "File2: $InodeOfFile{$LinkIndex}, Associated files: @{$FilesOfInode{$InodeOfFile{$LinkIndex}}}, md5sum: $KnownMd5sums{$InodeOfFile{$LinkIndex}}."; + die ("LinkFile: paranoid file comparison failed for $BaseFile and $LinkName, please check why. Exiting."); + } + $Fsb=stat($BaseFile); #Refresh stat blocks in case either changed during file compare. + $Ssb=stat($LinkName); + if ( ! ( ($Fsb->mode == $Ssb->mode) && ($Fsb->uid == $Ssb->uid) && ($Fsb->gid == $Ssb->gid) && ($Fsb->size == $Ssb->size) ) ) { + Debug 0, "File1: $InodeOfFile{$BaseIndex}, Associated files: @{$FilesOfInode{$InodeOfFile{$BaseIndex}}}, md5sum: $KnownMd5sums{$InodeOfFile{$BaseIndex}}."; + Debug 0, "File2: $InodeOfFile{$LinkIndex}, Associated files: @{$FilesOfInode{$InodeOfFile{$LinkIndex}}}, md5sum: $KnownMd5sums{$InodeOfFile{$LinkIndex}}."; + die ("LinkFile: Second paranoid stat checks failed! Please check failure in linking $BaseFile and $LinkName. Exiting."); + } + #If the user asked to check mtime and the timestamps are not equal, something's wrong + if ( ($DatesEqual) && ($Fsb->mtime != $Ssb->mtime) ) { + Debug 0, "File1: $InodeOfFile{$BaseIndex}, Associated files: @{$FilesOfInode{$InodeOfFile{$BaseIndex}}}, md5sum: $KnownMd5sums{$InodeOfFile{$BaseIndex}}."; + Debug 0, "File2: $InodeOfFile{$LinkIndex}, Associated files: @{$FilesOfInode{$InodeOfFile{$LinkIndex}}}, md5sum: $KnownMd5sums{$InodeOfFile{$LinkIndex}}."; + die ("LinkFile: mtime paranoid check failed! Please check failure in linking $BaseFile and $LinkName. Exiting."); + } + #Debug 2, " Paranoid checks passed for $BaseFile and $LinkName."; + } + + #Actually link and check return code + if ($ActuallyLink) { + my $Ssb=stat($LinkName); #Have to grab stat before link or else you're looking at nlinks of the merged inode. + #FIXME - how to handle case where LinkName unlinked but link call fails? + if (unlink($LinkName) && link($BaseFile,$LinkName)) { + Debug 1, " linked $BaseFile $LinkName"; + if ($Ssb->nlink == 1) { + #undef LinkName md5sum here, we just removed the last dentry pointing at the file. + undef $KnownMd5sums{$InodeOfFile{$LinkIndex}}; + undef $NewMd5sums{$InodeOfFile{$LinkIndex}}; + $SpaceSaved += $Ssb->size; + } + #Add $LinkIndex to the list of files on the same inode as $BaseIndex + if ($#{$FilesOfInode{$InodeOfFile{$BaseIndex}}} >= $MaxFiles) { + $DroppedFilenames++; + } else { + push @{$FilesOfInode{$InodeOfFile{$BaseIndex}}}, $LinkIndex; + } + +#FIXME Should/could we strip the ProcessInodesOfSize::$Tail directly instead? Not sure it would influence the link. +#Perhaps ProcessInodesOfSize could use a hand crafted walk through an array (that this routine could modify directly) instead? + #Strip $LinkIndex from $FilesOfInode{$InodeOfFile{$LinkIndex}} + my @TempFiles = @{$FilesOfInode{$InodeOfFile{$LinkIndex}}}; + if ($#TempFiles == -1) { + die "Empty FOI-IOF array for $LinkIndex, $InodeOfFile{$LinkIndex}, shouldn't happen."; + } elsif ($#TempFiles == 0) { + if ( ($Paranoid) && ($FilesOfInode{$InodeOfFile{$LinkIndex}}[0] ne $LinkIndex) ) { + die "Single Element list $FilesOfInode{$InodeOfFile{$LinkIndex}}[0] doesn't match $LinkIndex."; + } + #Only a single element, undef it + undef $FilesOfInode{$InodeOfFile{$LinkIndex}}; + } else { + #At least 2 array elements + undef $FilesOfInode{$InodeOfFile{$LinkIndex}}; #Start fresh + foreach my $AFileIndex (@TempFiles) { + if ($AFileIndex ne $LinkIndex) { + #if (defined($FilesOfInode{$InodeOfFile{$LinkIndex}})) { + push @{$FilesOfInode{$InodeOfFile{$LinkIndex}}}, $AFileIndex; + #} else { + # $FilesOfInode{$InodeOfFile{$LinkIndex}} = [ $AFileIndex ]; + #} + } + } + } + + #Setting the correct Inode for this file must come after the above. + $InodeOfFile{$LinkIndex}=$InodeOfFile{$BaseIndex}; + } else { + Debug 1, " Failed to link $BaseFile $LinkName"; + } + } else { + Debug 1, " Would have linked $BaseFile $LinkName"; + } + + ##Clear InodeOfFile entirely as we're done with this file size for the moment. + %InodeOfFile = ( ); +} #End sub LinkFiles + + +sub LinkInodes { + my $FirstInode = shift; + my $SecondInode = shift; + my $PreferredInode; #Also the return value for this function. 1 means left hand inode kept, + #right linked to it, 2 means right kept, left linked to it. Doesn't matter if no links + #are performed, some links are performed, or all links are performed, we just need to return + #which inode is preferred. + + my @FirstFileindexes = @{$FilesOfInode{$FirstInode}}; + my @SecondFileindexes = @{$FilesOfInode{$SecondInode}}; + + my $Fsb=stat("$Paths[$PathIndex[$FirstFileindexes[0]]]/$FileOf[$FirstFileindexes[0]]"); + my $Ssb=stat("$Paths[$PathIndex[$SecondFileindexes[0]]]/$FileOf[$SecondFileindexes[0]]"); + + if (! defined($FirstFileindexes[0])) { + #Debug 3, "No fileindexes for $FirstInode, why?."; + return; + } + if (! defined($SecondFileindexes[0])) { + #Debug 3, "No fileindexes for $SecondInode, why?."; + return; + } + + #Show progress with file size display + if ($LastSizeLinked != $Fsb->size) { + $LastSizeLinked = $Fsb->size; + Debug 1, " " . $Fsb->size; + } + +#Who links to who? First, make the choice: +#If one of the inodes is a more sparse file, we link to that. In the end that gives more space savings. + if ($Fsb->blocks < $Ssb->blocks) { #Link SecondFiles to more sparse FirstInode + #Debug 3, " First more sparse."; +#The files are stripped from FilesOfInode by LinkFiles one by one as they're processed. That's OK. + $PreferredInode = 1; + } elsif ($Fsb->blocks > $Ssb->blocks) { + #Debug 3, " Second more sparse."; + $PreferredInode = 2; +#Next, if one of the files is older (smaller modification time) link both to the older inode. + } elsif ($Fsb->mtime > $Ssb->mtime) { #First file newer, link it to Second + #Debug 3, " First newer."; + $PreferredInode = 2; + } elsif ($Ssb->mtime > $Fsb->mtime) { #Second file newer, link it to First + #Debug 3, " Second newer."; + $PreferredInode = 1; +#Finally, if they use the same amount of space on disk and have the same mtime, see if one has more links than the other and glue both to the inode with more links. + } elsif ($Ssb->nlink > $Fsb->nlink) { #Second inode has more hardlinks, link all firsts to it + #Debug 3, " Second more hardlinks."; + $PreferredInode = 2; +#(If they have the same amount of links or the first has more links, we'll hit this case and simply link any second files to the first inode by default.) + } else { + #Debug 3, " First more hardlinks or equal."; + $PreferredInode = 1; + } + + +#Second, actually perform the links (or at least record estimated savings, which needs to be done here at the inode level) + if ($PreferredInode == 1) { + #Get estimated space savings on dry run + if ($ActuallyLink) { + foreach my $OneSecondFileindex (@SecondFileindexes) { #Link all second inode fileindexes to the preferred first inode + LinkFiles $FirstFileindexes[0], $OneSecondFileindex; + } + } else { + #Debug 4, "FirEstUpdate: " . @SecondFileindexes . "," . $Ssb->nlink; + if (@SecondFileindexes == $Ssb->nlink) { + $EstimatedSpaceSaved += $Ssb->size; + } elsif (@SecondFileindexes > $Ssb->nlink) { + die @SecondFileindexes . " second filenames can't be larger than " . $Ssb->nlink . ", why is it?"; + } #no savings, nothing to do if (@SecondFileindexes < $Ssb->nlink) + foreach my $OneSecondFileindex (@SecondFileindexes) { #Link all second inode fileindexes to the preferred first inode + #This is a mini version of LinkFiles for ActuallyLink=no + if ( ($FileNamesEqual) && ($FileOf[$FirstFileindexes[0]] ne $FileOf[$OneSecondFileindex]) ) { + #Debug 3, "$Paths[$PathIndex[$FirstFileindexes[0]]]/$FileOf[$FirstFileindexes[0]] and $OneSecondFileindex have different filenames at the end, not linking."; + return; + } else { + Debug 1, " Would have linked $Paths[$PathIndex[$FirstFileindexes[0]]]/$FileOf[$FirstFileindexes[0]] $Paths[$PathIndex[$OneSecondFileindex]]/$FileOf[$OneSecondFileindex]"; + } + } + } + } elsif ($PreferredInode == 2) { + if ($ActuallyLink) { + foreach my $OneFirstFileindex (@FirstFileindexes) { #Link all first inode fileindexes to the preferred second inode + LinkFiles $SecondFileindexes[0], $OneFirstFileindex; + } + } else { + #Debug 4, "SecEstUpdate: " . @FirstFileindexes . "," . $Fsb->nlink; + if (@FirstFileindexes == $Fsb->nlink ) { + $EstimatedSpaceSaved += $Fsb->size; + } elsif (@FirstFileindexes > $Fsb->nlink ) { + die @FirstFileindexes . " first fileindexes can't be larger than " . $Fsb->nlink . ", why is it?"; + } #no savings, nothing to do if (@FirstFileindexes < $Fsb->nlink ) + foreach my $OneFirstFileindex (@FirstFileindexes) { #Link all first inode fileindexes to the preferred second inode + #This is a mini version of LinkFiles for ActuallyLink=no + if ( ($FileNamesEqual) && ($FileOf[$SecondFileindexes[0]] ne $FileOf[$OneFirstFileindex]) ) { + #Debug 3, $SecondFileindexes[0] . " and $OneFirstFileindex have different filenames at the end, not linking."; + return; + } else { + Debug 1, " Would have linked $Paths[$PathIndex[$SecondFileindexes[0]]]/$FileOf[$SecondFileindexes[0]] $Paths[$PathIndex[$OneFirstFileindex]]/$FileOf[$OneFirstFileindex]"; + } + } + } + } else { + die "Internal error, PreferredInode is $PreferredInode."; + } + return $PreferredInode; +} #End sub LinkInodes + + +sub CheckForLinkableInodes { + my $FirstInode = shift; + my $SecondInode = shift; + my $DidWeLink = 0; #Return value for this function. 0 means no link performed, 1 means left hand inode kept, + #right linked to it, 2 means right kept, left linked to it (1 and 2 have to come from LinkInodes). + + #FIXME - printing packed format + #Debug 2, " Comparing $FirstInode to $SecondInode"; + + #Here we're using the file characteristics encoded in the InodeSpec to identify candidates for compare. If Paranoid is turned on, we'll re-verify + #all this just before linking. Turning Paranoid off risks problems with files being modified under us or a checksum cache with invalid entries. + my ($Fdev, $Fino, $Fmode, $Fuid, $Fgid, $Fsize, $Fmtime, $Fctime) = unpack("SLSSSLLL", $FirstInode); + #Debug 4, "$Fdev, $Fino, $Fmode, $Fuid, $Fgid, $Fsize, $Fmtime, $Fctime"; + + my ($Sdev, $Sino, $Smode, $Suid, $Sgid, $Ssize, $Smtime, $Sctime) = unpack("SLSSSLLL", $SecondInode); + #Debug 4, "$Sdev, $Sino, $Smode, $Suid, $Sgid, $Ssize, $Smtime, $Sctime"; + + if ($Fdev == $Sdev) { + #Same device + if ($Fino != $Sino) { + # Debug 2, " Tried to link identical Inodes, should not have happened."; + #} else { + #Same device, different inodes. Can we link them? + if ( ($Fmode == $Smode) && ($Fuid == $Suid) && ($Fgid == $Sgid) && ($Fsize == $Ssize) ) { + #Same device, different inodes, same base characteristics. Check modification time if the user wanted it. + #The following loosely translates to "Continue on with the link checks if the user didn't care or the files have the same time anyways." + if ( (!($DatesEqual)) || ($Fmtime == $Smtime) ) { + #Same device, different inodes, same characteristics. Checksums match? + #Note - we can't check for FileNamesEqual here. We'll leave that until we actually have filenames to compare and check + #that in LinkFiles + if (defined($FilesOfInode{$FirstInode}) && defined($FilesOfInode{$SecondInode})) { + #@{$FilesOfInode{$FirstInode}}[0] is the first filename associated with $FirstInode + #@{$FilesOfInode{$SecondInode}}[0] is the first filename associated with $SecondInode + if ( MD5SumOf(@{$FilesOfInode{$FirstInode}}[0]) eq MD5SumOf(@{$FilesOfInode{$SecondInode}}[0]) ) { #DO NOT use == for md5sums; the sum appears to overflow perl integers, or ignore chars perhaps + #my $FirstSumDebug=MD5SumOf(@{$FilesOfInode{$FirstInode}}[0]); + #my $SecondSumDebug=MD5SumOf(@{$FilesOfInode{$SecondInode}}[0]); + #Debug 4, "Sum1: $FirstSumDebug, Sum2: $SecondSumDebug"; + #Debug 2, " Identical, linking @{$FilesOfInode{$FirstInode}}[0] and @{$FilesOfInode{$SecondInode}}[0] and any other filenames."; + $DidWeLink=LinkInodes $FirstInode, $SecondInode; + if (($DidWeLink != 1) && ($DidWeLink != 2)) { + die "Invalid return ($DidWeLink) from LinkInodes"; + } + #} else { + # Debug 2, " Checksums don't match."; + } + #} else { + # Debug 3, " Ignoring stripped file."; + } + #} else { + # Debug 2, " Not linking, different mtimes and user specified DatesEqual."; + } + #} else { + # Debug 2, " Can't hardlink, different attributes."; + } + } + #} else { + # Debug 3, " Different devices, no chance to link."; + } + return $DidWeLink; +} #End sub CheckForLinkableInodes + + +#Start Main() +my $USAGEMSG = <<USAGE; +Usage freedups.pl [options] +Options (default value in parentheses; 1=Enabled, 0=Disabled): + --actuallylink|-a Actually link the files, otherwise, just report on potential savings and preload the md5sum cache. ($ActuallyLink) + --cachefile=<cache file> File that holds cached queries and responses ($CacheFile) * + --datesequal|-d Require that the modification dates and times be equal before linking ($DatesEqual) + --filenamesequal|-f Require that the two (pathless) filenames be equal before linking ($FileNamesEqual) + --help|-h This help message + --mafiles Maximum number of files to remember for a given inode, reduct to save memory ($MaxFiles) + --minsize|-m=<minimum size> Only consider files larger than this number of bytes ($MinSize) + --paranoid|-p Recheck all file stats and completely compare every byte of the files just before linking. This should definitely be left on unless you are _positive_ that the md5 checksum cache is correct and there's no chance that files will be modified behind freedups' back. ($Paranoid) + --quiet|-q Show almost nothing; forces verbosity to 0. + --verbose|-v Show more detail (Default verbosity=$Verbosity) +* For security reasons, this file must be created before starting freedups or it will not be used at all. + + +Examples: +To report on what files could be linked under any kernel source trees and preload the md5sum cache, but not actually link them: + freedups /usr/src/linux-* +To link identical files in those trees: + freedups -a /usr/src/linux-* +To be more strict; the modification time and filename need to be equal before two files can be linked: + freedups -a --datesequal=yes -f /usr/doc /usr/share/doc +Only link files with 1001 or more bytes. + freedups --actuallylink=yes -m 1000 /usr/src/linux-* /usr/src/pcmcia-* +USAGE + +#Load command line params. Directories to be scanned are left in ARGV so we can pull them with shift in a moment. +die "$USAGEMSG" unless GetOptions( 'actuallylink|a!' => $ActuallyLink, + 'cachefile=s' => $CacheFile, + 'datesequal|d!' => $DatesEqual, + 'filenamesequal|f!' => $FileNamesEqual, + 'help|h' => $Help, + 'maxfiles=i' => $MaxFiles, + 'minsize|m=i' => $MinSize, + 'paranoid|p!' => $Paranoid, + 'quiet|q' => sub { $Verbosity = 0 }, + 'verbose|v+' => $Verbosity ); + +die "$USAGEMSG" if $Help; + +if ($MaxFiles <= 0) { + $MaxFiles=1 +} + +#Start main code +print "Freedups Version $FreedupsVer\n"; +print "Options Chosen: "; +print "ActuallyLink " if $ActuallyLink; +print "DatesEqual " if $DatesEqual; +print "FileNamesEqual " if $FileNamesEqual; +#If Help set, we won't get this far +print "Paranoid " if $Paranoid; +print "None " if (!( ($ActuallyLink) || ($DatesEqual) || ($FileNamesEqual) || ($Paranoid) )); +print "Verbosity=$Verbosity "; +print "CacheFile=$CacheFile "; +print "MaxFiles=$MaxFiles "; +my $SmallestFileSize = $MinSize + 1; +print "MinSize=$MinSize (only consider files $SmallestFileSize bytes and larger) "; +undef $SmallestFileSize; +print "\n"; + +print "Starting to load md5 checksum cache from $CacheFile.\n"; +LoadSums $CacheFile; #Wait until we've verified the filespecs before loading the cache as this can take time. +print "Finished loading checksums from checksum cache.\n"; + +#Load dir specs from command line +while (my $OneSpec = shift) { + Debug 1, "Starting to scan $OneSpec"; + #Check that it exists first + if (-e "$OneSpec") { + File::Find::find(&wanted, $OneSpec); #subroutine could also be written {wanted => &wanted} + #This calls IndexFile(the_found_filename) which puts file info into the inode and file arrays and as of 0.6.3 actually does _all_ the processing. + $NumSpecs++ + } else { + die "Could not find anything named $OneSpec, exiting.\n"; + } +} + +if ($NumSpecs == 0) { + die "$USAGEMSG\nNo directories or files specified, exiting.\n"; +} + +print "Finished processing inodes, appending new md5sums.\n"; +SaveSums $CacheFile; +print "Finished saving md5sums.\n"; + +print "$NumSpecs file specs searched.\n"; +print "$UniqueFilesScanned Unique files scanned.\n"; +print "$UniqueInodesScanned Unique inodes scanned.\n"; +print "$DroppedFilenames filenames were discarded because there were already $MaxFiles filenames for that inode.\n"; +print "Cached checksums: $CachedSums, From disk checksums: $FromDiskSums.\n"; +if ($ActuallyLink) { + print "Space saved: $SpaceSaved\n"; +} elsif ($EstimatedSpaceSaved == 0) { + print "No space would have been saved.\n"; +} else { + print "Up to $EstimatedSpaceSaved bytes would have been saved.\n"; +} +#print "$SolitaryInodeSizes file sizes for which there was a single inode.\n"; #We no longer know this +#print "$MultipleInodeSizes file sizes for which there was more than one inode.\n"; #nor this. +if ($DiscardedSmallSums >= 1) { + print "Discarded $DiscardedSmallSums checksums of small files.\n"; +} +
commit b82e1384b77c63f2200e8ca839d16dd99ffdcf80 Author: Robert 'Bob' Jensen bob@fedoraunity.org Date: Sat Sep 18 15:19:19 2010 -0500
Moving rpms/freedup in to it's own folder and adding the freedups package in a folder of it's own
diff --git a/rpms/README b/rpms/README deleted file mode 100644 index b04d360..0000000 --- a/rpms/README +++ /dev/null @@ -1,9 +0,0 @@ -# simple description.. -# I have uploaded the src.rpm to here, because its still pending inclusion -# into the repos. Hopefully it wont be needed too much longer. -# -# I've tested it against 12/13/14 and rawhide (currently F15) -# enjoy! -# -# Note, the debuginfo packages don't work currently, so if you install the -# src.rpm, you're on your own. diff --git a/rpms/freedup-1.5-7.fc13.src.rpm b/rpms/freedup-1.5-7.fc13.src.rpm deleted file mode 100644 index 85bf6ac..0000000 Binary files a/rpms/freedup-1.5-7.fc13.src.rpm and /dev/null differ
reflector-commits@lists.stg.fedorahosted.org