My wife and I are (or at least were) shutter bugs of a sort. At this moment, I have just a bit shy of 1.5 terabytes of photos that she and I have take over the years. I’ve also managed to make a hash of them with copies, duplicates and the occasional “I think I have this somewhere, but I can’t say for certain” directory.
I’ve been looking for the past year or two for a solution and still haven’t really found one I liked, so like a good nerd, I’ve rolled my own. It’s cobbled together using BASH, ImageMagick, dcraw and MediaInfo.
My primary goal was to make sure that I had one copy of every file, not necessarily one high quality version of each picture or video. Meaning, that if I end up with duplicated of a picture in RAW, high res JPEG and a JPEG thumb, I’m ok with that. Once I have the initial culling of the photos, then I make take another swipe at further deduping it.
Anyway, my script starts by recursively looping through the current path and all subdirectories. If it encounters a file, it will retrieve the extension and then conditionally call some combination of the above utilities to retrieve the creation/capture/modified timestamp and the camera make/model. It does this fairly well, but there are some major caveats which I’ll discuss in a bit.
Once it has retrieved the above, it starts creating the following folder structure:
/<camera>/<year>/<month>/<day>
It then takes the file and tries to copy it into the following:
/<<camera>/<year>/<month>/<day>/<year><month><day><hour><minute><second>.<#>.<ext>
This should, in theory, allow me to identify a specific picture taken by a specific type of camera at a specific moment in time. The initial issue I ran into with this is around time resolution. The timestamps given to me by the various tools only resolve to the second (not millisecond like I’d prefer). This means that if you have a camera that can take multiple pictures per second, then you can easily end up with duplicates, hence the <#> at the end.
If I encounter a file that is the same timestamp, I then do a MD5 sum on both files to confirm they are actually the same time. If they are, then off to a duplicates tree the file goes. If they aren’t the same, then I start an auto increment pass until I can write the file out uniquely in the target folder.
One issue, though, is that if the 0 file doesn’t match, I don’t check for subsequent matches, so the script could easily end up with files 1 2 3 and 4 all being duplicates. Maybe I’ll try and fix that in a future edit.
As for the tooling, I use the following:
-
BASH
-
ImageMagick’s “identify --verbose” command to get information a JPEGs
-
MediaInfo for details on MP4/M4V/AVI/MOV files
-
dcraw for details on RAW files (such as Nikon’s NEF/NRW)
Probably one of the biggest issues I have is that while your typical JPG/NRW/NEF file includes the camera details, a video typically does not. That means that I’m a bit hard pressed to determine what camera took a specific video. I also found that the camera metadata for when the file was captured isn’t always that useful, so there are limits.
One other thing to note: ImageMagick isn’t always that fast, so there’s room for improvement on this, specifically around JPEG processing. I was hoping to use MacOSs mdls for getting the camera data, but that only works if the filesystem is local (not mounted like mine was).
If you’re a BASH expert, please be kind. I’m good at programming, not always good at scripting. Otherwise, help yourself.
#!/bin/bash
BASE="/Volumes/e/Pictures/Processed"
moveFile()
{
# local SOURCE="$1"
# local BASE="$2"
# local EXT="$3"
# local SUFFIX="$4"
if [ -f "$2.$4.$3" ] ; then
moveFile "$1" "$2" "$3" $(($4 + 1 ))
else
mv -n "$1" "$2.$4.$3"
fi
}
moveNonDuplicateFile()
{
mkdir -p "$BASE/Sorted/$2/$3/$4/$5"
local T="$BASE/Sorted/$2/$3/$4/$5/$3$4$5$6$7$8"
moveFile "$1" "$T" "$9" "0"
}
# moveDuplicateFile "$1" "$BASE" "$CAMERA" "$YEAR" "$MONTH" "$DAY" "$HOUR" "$MINUTE" "$SECOND" "$EXT"
moveDuplicateFile()
{
mkdir -p "$BASE/Duplicates/$2/$3/$4/$5"
local T="$BASE/Duplicates/$2/$3/$4/$5/$3$4$5$6$7$8"
moveFile "$1" "$T" "$EXT" 0
}
processMOV()
{
TIMESTAMP=`mediainfo "$1" | grep "Encoded date" | head -n 1 | sed 's/Encoded date//' | awk '{$1=$1;print}' | sed 's/: //'`
YEAR=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%Y`
MONTH=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%m`
DAY=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%d`
HOUR=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%H`
MINUTE=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%M`
SECOND=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%S`
if [ -z "$CAMERA" ] ; then
CAMERA="MOV"
fi
}
processRAW()
{
TIMESTAMP=`dcraw -i -v "$1" | grep Timestamp | sed s/Timestamp\:\ //`
YEAR=`date -jf "%a %b %d %H:%M:%S %Y" "$TIMESTAMP" +%Y`
MONTH=`date -jf "%a %b %d %H:%M:%S %Y" "$TIMESTAMP" +%m`
DAY=`date -jf "%a %b %d %H:%M:%S %Y" "$TIMESTAMP" +%d`
HOUR=`date -jf "%a %b %d %H:%M:%S %Y" "$TIMESTAMP" +%H`
MINUTE=`date -jf "%a %b %d %H:%M:%S %Y" "$TIMESTAMP" +%M`
SECOND=`date -jf "%a %b %d %H:%M:%S %Y" "$TIMESTAMP" +%S`
CAMERA=`dcraw -i -v "$1" | grep 'Camera:' | awk -F\: '{ print $2 }' | tr '[:lower:]' '[:upper:]' | awk '{$1=$1;print}'`
}
processAVI()
{
TIMESTAMP=`mdls "$1" | grep kMDItemContentCreationDate | sed 's/kMDItemContentCreationDate = //'`
if [ "$TIMESTAMP" == "" ] ; then
YEAR="0000"
MONTH="00"
DAY="00"
HOUR="00"
MINUTE="00"
SECOND="00"
else
YEAR=`date -jf "%Y:%m:%d %H:%M:%S %z" "$TIMESTAMP" +%Y`
MONTH=`date -jf "%Y-%m-%d %H:%M:%S %z" "$TIMESTAMP" +%m`
DAY=`date -jf "%Y-%m-%d %H:%M:%S %z" "$TIMESTAMP" +%d`
HOUR=`date -jf "%Y-%m-%d %H:%M:%S %z" "$TIMESTAMP" +%H`
MINUTE=`date -jf "%Y-%m-%d %H:%M:%S %z" "$TIMESTAMP" +%M`
SECOND=`date -jf "%Y-%m-%d %H:%M:%S %z" "$TIMESTAMP" +%S`
fi
if [ -z "$CAMERA" ] ; then
CAMERA="AVI"
fi
}
processJPG()
{
# TIMESTAMP=`mdls "$1" | grep kMDItemContentCreationDate | sed 's/kMDItemContentCreationDate = //'`
# CAMERA=`mdls "$1" | grep kMDItemAcquisitionModel | sed 's/kMDItemAcquisitionModel = \"//' | sed s/\"//`
TIMESTAMP=`identify -verbose "$1" | grep DateTimeDigitized | sed 's/ exif:DateTimeDigitized: //'`
TIMESTAMP="$TIMESTAMP -0000"
CAMERA=`identify -verbose "$1" | grep "exif:Model" | sed 's/ exif:Model: //'`
if [ "$TIMESTAMP" == " -0000" ] ; then
TIMESTAMP=`identify -verbose "$1" | grep "date:modify" | sed 's/ date:modify: //' | sed 's/\(.*\)-\(.*\)-\(.*\)T\(.*\)\([+-]\)\(.*\):\(.*\)/\1:\2:\3 \4 \5\6\7/'`
#TIMESTAMP=`mdls "$1" | grep kMDItemFSContentChangeDate | sed 's/kMDItemFSContentChangeDate = //'`
fi
# | awk '{$1=$1;print}'`
#echo $TIMESTAMP / $IMAGE
# 2014-07-05T11:12:16-04:00
if [ "$TIMESTAMP" == "" ] ; then
YEAR="0000"
MONTH="00"
DAY="00"
HOUR="00"
MINUTE="00"
SECOND="00"
else
# YEAR=`date -jf "%Y-%m-%d %H:%M:%S %z" "$TIMESTAMP" +%Y`
# MONTH=`date -jf "%Y-%m-%d %H:%M:%S %z" "$TIMESTAMP" +%m`
# DAY=`date -jf "%Y-%m-%d %H:%M:%S %z" "$TIMESTAMP" +%d`
#
# HOUR=`date -jf "%Y-%m-%d %H:%M:%S %z" "$TIMESTAMP" +%H`
# MINUTE=`date -jf "%Y-%m-%d %H:%M:%S %z" "$TIMESTAMP" +%M`
# SECOND=`date -jf "%Y-%m-%d %H:%M:%S %z" "$TIMESTAMP" +%S`
YEAR=`date -jf "%Y:%m:%d %H:%M:%S %z" "$TIMESTAMP" +%Y`
MONTH=`date -jf "%Y:%m:%d %H:%M:%S %z" "$TIMESTAMP" +%m`
DAY=`date -jf "%Y:%m:%d %H:%M:%S %z" "$TIMESTAMP" +%d`
HOUR=`date -jf "%Y:%m:%d %H:%M:%S %z" "$TIMESTAMP" +%H`
MINUTE=`date -jf "%Y:%m:%d %H:%M:%S %z" "$TIMESTAMP" +%M`
SECOND=`date -jf "%Y:%m:%d %H:%M:%S %z" "$TIMESTAMP" +%S`
fi
if [ -z "$CAMERA" ] ; then
CAMERA="Unidentified"
fi
}
processMPEG4()
{
TIMESTAMP=`mediainfo "$1" | grep "Encoded date" | head -n 1 | sed 's/Encoded date//' | awk '{$1=$1;print}' | sed 's/: //'`
YEAR=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%Y`
MONTH=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%m`
DAY=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%d`
HOUR=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%H`
MINUTE=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%M`
SECOND=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%S`
if [ -z "$CAMERA" ] ; then
CAMERA="MP4"
fi
}
processFile()
{
echo "Processing file $1"
local EXT=`echo "$1" | sed 's/.*\.\([A-Za-z0-9]*\)/\1/' | tr '[:lower:]' '[:upper:]'`
case $EXT in
# Picture Formats Here
NEF)
processRAW "$1"
;;
NRW)
processRAW "$1"
;;
JPG)
processJPG "$1"
;;
# Media Formats Here
AVI)
processAVI "$1" "$EXT"
;;
MOV)
processMOV "$1" "$EXT"
;;
MP4)
processMPEG4 "$1" "$EXT"
;;
M4V)
processMPEG4 "$1" "$EXT"
continue
;;
DB)
rm "$1"
return 0
;;
PNG)
rm "$1"
moveFile "$BASE/Other/PNG" "$2" 0 "PNG"
continue
;;
PANO)
rm "$1"
moveFile "$BASE/Other/PANO" "$2" 0 "PANO"
continue
;;
\*)
continue
;;
DS_STORE)
rm "$1"
;;
*)
echo Unmaped extension $EXT
continue
;;
esac
if [ -z "$YEAR" ] ; then
echo "Image with no YEAR"
continue
fi
if [ -z "$MONTH" ] ; then
echo "Image with no MONTH"
continue
fi
if [ -z "$DAY" ] ; then
echo "Image with no DAY"
continue
fi
if [ -z "$HOUR" ] ; then
echo "Image with no HOUR"
continue
fi
if [ -z "$MINUTE" ] ; then
echo "Image with no MINUTE"
continue
fi
if [ -z "$SECOND" ] ; then
echo "Image with no SECOND"
continue
fi
TARGET="$BASE/Sorted/$CAMERA/$YEAR/$MONTH/$DAY/$YEAR$MONTH$DAY$HOUR$MINUTE$SECOND.0.$EXT"
if [ -f "$TARGET" ] ; then
SOURCEHASH=`md5 -r "$1" | awk '{ print $1; }'`
TARGETHASH=`md5 -r "$BASE/Sorted/$CAMERA/$YEAR/$MONTH/$DAY/$YEAR$MONTH$DAY$HOUR$MINUTE$SECOND.0.$EXT" | awk '{ print $1; }'`
if [ "$SOURCEHASH" == "$TARGETHASH" ] ; then
moveDuplicateFile "$1" "$CAMERA" "$YEAR" "$MONTH" "$DAY" "$HOUR" "$MINUTE" "$SECOND" "$EXT"
else
moveNonDuplicateFile "$1" "$CAMERA" "$YEAR" "$MONTH" "$DAY" "$HOUR" "$MINUTE" "$SECOND" "$EXT"
fi
else
moveNonDuplicateFile "$1" "$CAMERA" "$YEAR" "$MONTH" "$DAY" "$HOUR" "$MINUTE" "$SECOND" "$EXT"
fi
# mv -n "$1" "$TARGET"
}
processDirectory()
{
echo "Processing dir $1"
cd "$1"
YEAR=""
MONTH=""
DAY=""
HOUR=""
MINUTE=""
SECOND=""
CAMERA=""
for FILE in * ; do
if [ -d "$1/$FILE" ] ; then
processDirectory "$1/$FILE"
# rmdir "$1/$FILE"
else
processFile "$1/$FILE" "$FILE"
fi
done
if [ -f ".DS_Store" ] ; then
rm .DS_Store
fi
cd ..
rmdir "$1"
}
CURRENT=`pwd`
processDirectory "$CURRENT”