Reputation: 45652
I want to write a program that deletes duplicate iTunes music files. One approach to identifying dupes is to compare MD5 digests of the MP3 and m4a files. Is there a more efficient strategy?
BTW the "Display Duplicates" menu command on iTunes shows false positives. Apparently it just compares on the Artist and Track title strings.
Upvotes: 4
Views: 2599
Reputation: 24759
If you use hashes to compare two sets of data, ideally they'd have to have exactly the same input each time in order to get exactly the same output (unless you miraculously picked two collisions of different input resulting in the same output). If you wanted to compare two MP3 files by hashing the entire file, then the two sets of song data might be exactly the same but since ID3 is stored inside the file, discrepancies there might make the files appear to be completely different. Since you're using a hash, you won't notice that perhaps 99% of the two files are matches because the outputs will be too different.
If you really want to use a hash to do this, you should only hash the sound data excluding any tags that may be attached to the file. This is not recommended, if music is ripped from CDs for example, and the same CD is ripped two different times, the results might be encoded/compressed differently depending on ripping parameters.
A better (but much more complicated) alternative would be an attempt to compare the uncompressed audio data values. With a little trial and error with known inputs can lead to a decent algo. Doing this perfectly will be very hard (if possible at all), but if you get something that's more than 50% accurate, it'll be better than going through by hand.
Note that even an algorithm that can detect if two songs are close (say the same song ripped under different parameters), the algo would have to be more complex than it's worth to tell if a live version is anything like a studio version. If you can do that, there's money to be made here!
And touching back on the original idea of how fast to tell if they're duplicates. A hash would be a lot faster, but a lot less accurate than any algorithm with this purpose. It's speed vs accuracy and complexity.
Upvotes: 3