Reputation: 389
how can i create a checksum of only the media data without the metadata to get a stable identification for a media file. preferably an cross platform approach with a library that has support for many formats. e.g. vlc, ffmpeg or mplayer.
(media files should be audio and video in common formats, images would be nice to have too)
Upvotes: 8
Views: 2953
Reputation: 15009
Well, it may be 11 years too late for an answer, but in case others like me stumble upon this...
ffmpeg
can output checksums for individual streams. So the same audio or video would output the same checksum independently of it's container format or metadata.
Example for the video track of file $filename
, writing the output to $filename.md5
:
ffmpeg -i "$filename" -map 0:v -codec copy -f md5 "$filename.md5"
For audio, use -map 0:a
.
To output to STDOUT, use -
. For example:
ffmpeg -i "$filename" -map 0:a -codec copy -hide_banner -loglevel warning -f md5 -
Upvotes: 8
Reputation: 26
Here is a shell script around mvik's ffmpeg-based answer which prints the MD5 in case of success, or the stderr output in case of failure.
#!/bin/bash
# Compute the MD5 of the audio stream of an MP3 file, ignoring ID3 tags.
# The problem with comparing MP3 files is that a simple change to the ID3 tags
# in one file will cause the two files to have differing MD5 sums. This script
# avoids that problem by taking the MD5 of only the audio stream, ignoring the
# tags.
# Note that by virtue of using ffmpeg, this script happens to also work for any
# other audio file format supported by ffmpeg (not just MP3's).
set -e
stdoutf=$( mktemp mp3md5.XXXXXX )
stderrf=$( mktemp mp3md5.XXXXXX )
set +e
ffmpeg -i "$1" -c:a copy -f md5 - >$stdoutf 2>$stderrf
ret=$?
set -e
if test $ret -ne 0 ; then
cat $stderrf
else
cat $stdoutf | sed 's/MD5=//'
fi
rm -f $stdoutf $stderrf
exit $ret
Upvotes: 0
Reputation: 10534
I don't know of any existing platform-independent software that will accomplish this, but I do know a way that this could be accomplished in an interpreted (platform-independent) language such as Java.
Essentially, we simply need to strip any metadata (tags) from the file, demultiplexing video files beforehand. Theoretically after demux and removing metadata, one could hash the file and compare against another file that has undergone the same process to match identical files despite having different tags. Unlike a fingerprint, this would not identify similar songs/movies but identical files (imagine you might want the 10 different versions or bitrates of a given song you've archived, but don't want 2 identical copies of any of them floating around).
The most troubling part of this is removing tags as there are many different specifications for tag formats which are not necessarily implemented the same across different applications, i.e. the same exact audio file given identical tags separately through two different applications may not result in identical output files. The only way this could pose an issue fatal to the concept of an audio-only checksum is if popular tagging software makes any changes to the binary audio portion of the file, or pads the audio in a non-standard way.
Taking a checksum is trivial, but I'm not aware off the top of my head of any platform independent libraries to demux and detag mpeg files. I know that in 'nix environments, mpgtx is a great command-line tool that could perform the demux and detag, but obviously that is not a platform-independent solution.
Maybe someone out there feels ambitious?
Upvotes: 3
Reputation: 389
one possible solution i found seems to be with vlc:
./VLC -I rc snd.mp3 :sout='#std{mux=raw,access=file,dst=-}' vlc://quit | sha1sum
Upvotes: 0