Home
| Calendar
| Mail Lists
| List Archives
| Desktop SIG
| Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings Linux Cafe | Meeting Notes | Linux Links | Bling | About BLU |
How to remove duplicate files. Or: "how to clean up after your photo-collecting self after the Holidays" Digital photography is great -- until the day you realize that you have over 5,428 photos and have stashed bits and pieces of your collection here or there (with multiple redundant copies). Bottom line: get fslint to find and eliminate duplicate files on your system. The techniques and solutions I used are listed in my blog post. Comments and improvements please. http://freephile.com/cms/content/remove-duplicate-files ### Full text reproduced below ### Remove duplicate files Mon, 2009/01/05 - 10:04pm ? greg Or, how to clean up after your photo-collecting self. Digital photography is great -- until the day you realize that you have over 5,428 photos and have stashed bits and pieces of your collection here or there (with multiple redundant copies). Bottom line: get fslint to find and eliminate duplicate files on your system. Like most people, I've collected a large number of photos and other images over the years. All digital since about 2001, and stored in a variety of applications to make displaying them etc. that much easier. I've stored them in folders. Sometimes taking time to organize them. Sometimes there is not time - just a pile of picture files. I've written perl, javascript or php scripts to organize and display my images. As I increased my skill at making my computer bend to my wishes, I found that other people wrote better programs to organize photos. So, I used Gallery. Finally, I came to realize that digiKam is the coolest thing since digital photography. digiKam can export to Gallery for sharing photos online, with the added benefit of completely organizing them on your local machine. digiKam can export to Flickr or Picassa if you don't want the privacy and control of your own photo website. digiKam does so much more. digiKam is what I use to manage all my photos now. The problem is that I want to put all my images into digiKam. Although digiKam has a nice import function, it only detects file name duplicates in the album you're importing into rather than detecting the duplicate across your entire image collection). So, as I am in the middle of re-organizing my photo collection, I have the tedious chore of finding duplicate images from locations where I originally stored them compared to various other locations, and I need to know if I might have already imported them to digiKam. Was that Christmas photo from 2002 saved in a folder labeled for the date the picture was taken, or for the date it was uploaded to the system? Did you accidentally experiment with a file renaming scheme or use the original names from the camera? Did you convert the image from jpg to png? If you delete this photo, are you deleting important minor modifications compared to the original like red-eye reduction? These types of questions make it not too trivial to re-organize your photos and ultimately make it a manually intensive effort no matter how many excellent tools you have. After some really good finger exercise using rsync rsync -nviarz # dry-run verbose itemize archive recursive compress \ --ignore-existing # use this option to quickly see what files are completely outside your target \ --progress --stats --checksum # use this option to do a thorough analysis of what files might have changed \ --exclude .directory # add exlcudes as needed to ignore file system cruft, thumbnails etc. \ --compare-dest ../2002/ # use this option to help when files may be organized into more than one target at the boundary of a year \ img/photos/2003_01_21/ img/library/albums/family/2003/ and bash #!/bin/bash SUSPECTS=/home/greg/img/photos/ FILTER='IMG*' for file in `find $SUSPECTS -maxdepth 1 -type f -name $FILTER | sed s:$SUSPECTS::`; do find img/library -name $file done #!/bin/bash A=img/photos/2006-10/ A=img/photos/2006-08/ A=img/photos/ B=img/library/albums/family/2006/ BASE=$(pwd) SOURCEDIR=$BASE/$A TARGETDIR=$BASE/$B pushd $A for FILE in $(ls *JPG); do # set a couple flags we can use to determine if the file has been processed FOUNDA=false FOUNDB=false # echo "checking for $FILE in $TARGETDIR" if [ -f $TARGETDIR$FILE ]; then echo 'the original file exists in target' FOUNDA=true fi COUSIN=$(echo $FILE |sed s/JPG/png/) # echo "considering $COUSIN too" if [ -f $TARGETDIR$COUSIN ]; then echo 'the original has already been converted to png' FOUNDB=true fi if [[ ( ! $FOUNDA ) && ( ! $FOUNDB ) ]]; then echo "$FILE not found: please copy" fi done popd I started looking for scripts that other people might have used to find duplicate files, and that's when I found fslint. I should have checked earlier. Still, the need to do visual comparisons with tools such as GQView, and to check/correct exif metadata in digiKam still means this is not a task that is easily solved with the push of a button. fslint just gives you a big time saver over writing and tweaking your own scripts. Note: digiKam has a built-in tool to find duplicate images in it's database, but it is resource intensive because it is content-based (building and comparing digital fingerprints of your files). In testing, it provides false positives. Even if it worked flawlessly, it would really be wasted effort to import thousands (of potential duplicates) just to find them and eliminate them. -- Greg Rundlett Web Developer - Initiative in Innovative Computing http://iic.harvard.edu m. 978-764-4424 o. 978-225-8302 skype/aim/irc/twitter freephile http://profiles.aim.com/freephile
BLU is a member of BostonUserGroups | |
We also thank MIT for the use of their facilities. |