Reputation: 5041
This is to be done in C++ or C.... I know we can read the MP3s' meta data, but that information can be changed by anyone, can't it? So is there a way to analyze a file's contents and compare it against another file and determine if it is in fact the same song?
edit Lots of interesting things coming out that I hadn't thought of. Not at all a good idea to attempt this.
Upvotes: 2
Views: 3824
Reputation: 7163
I think the Fast Fourier-Transform (FFT) approach hinted by jstanley is pretty good for most use cases; in particular, it works for verifying that the two are the same release/ same recording by the same artist/ same bitrate / audio quality.
To be more explicit, sox
and spek (via command line and GUI, respectively) can do this pretty painlessly.
Spek is pretty foolproof -- just open the software and point it to the two audio files in question.
sox
can generate spectograms (FFTs) from the command line line so:
sox "$file" -n spectrogram -o "$outfile"
.
The result from either are two images; if they look basically identical, then for almost all intents and purposes, the two songs will be equivalent.
For example, I wanted to test if these two files:
Soundtrack to an imaginary film mixtape 2011.mp3
DJRUM - Sountrack to an imaginary film mixtape 2011 (for mary-anne hobbs).mp3
were the same. diff
reported a difference in the binary files (perhaps due to metadata differences or minor encoding differences), but a quick glance at their spectrograms resolved it:
Upvotes: 1
Reputation: 3051
Use the open source EchoPrint library to create a signature of the two audio files, and compare them with each other.
The library is very easy to use, and has clear examples on how to create the signatures.
You can even query their database with the signature and find matching song metadata (such as title, artist, etc).
Upvotes: 0
Reputation: 62157
No. Not SO simple.
You can check they contain the same encoded data, BUT:
In both cases the bytes would not match.
Basically, if a solution looks too simple to be true, it often is.
Upvotes: 2
Reputation: 5133
If you want to compare MP3 files that originated from the same MP3, but have tagged with metadata differently, it would be straight forward to just compare the actual audio data. Since it originated from the same MP3 encoding, you should be able to do a byte by byte comparison. You would have to compare all byte. It should be sufficient to sample just a few to get a unique key that would be statistically almost impossible to find in another song.
If the files have been produced by different encoders, you would have to extract some "fuzzy" feature keys from the data and compare those keys. In a hurry I would probably construct an algorithm like this:
The benefit of a feature extraction approach is that you can build a database of features for all your mp3-files and for a single file ask the question: What other media files have exactly or almost exactly the same feature as this one. The feature lookup could be implemented very efficiently with R*-trees or similar, which could be used to give you a fast distance measurement between the n-dimensional feature sets.
The above technique is essentially a variant of what is used in image search algorithms such as SIFT, which is probably the base of such application as Photosynth and Google Goggles. In image searching you filter the image for good candidate points for relatively unique features (such as corners of shapes), then you normalize the area around that feature to get normalized color, intensity, scale and direction of features. Finally you extract the features and search an n-dimensional database of features of other images and verify that found features in other images are geometrically positioned in the same pattern as in your search image. The technique for searching audio would be the same, only simpler, since audio is one dimensional.
Upvotes: 1
Reputation: 93556
If you mean "same song" in the iTunes sense of "same recording", it would be possible to compares two audio files, but not by byte-by-byte comparison of an encoded file since even for the same format there are variables such as data rate and compression that are selected at time of encoding.
Also each encoding of the same recording may include different lead-in/lead-out timings, different amplitude and equalisation, and may have come from differing original sources (vinyl, CD, original master etc.). So you need a comparison method that takes all these variables into account, and even then you will end up with a 'likelihood' of a match rather than a definitive match.
If you genuinely mean "same song", i.e. any recording by any artist of the same composition and lyrics, then you are unlikely to get a high statistical correlation in most cases since pitch, tempo, range, instrumental arrangement will be very different.
In the "same recording" scenario, relatively simple signal processing and statistical techniques could be applied, in the "same song" scenario, AI techniques would need to be deployed, and even then the results I suspect would be poor.
Upvotes: 1
Reputation: 814
If I really had to do this, my first attempt would be to take a Fourier transform of both songs and compare the histograms. You can use FFTW (http://www.fftw.org/) to take the Fourier transform, and then compare the histograms by summing the squares of the differences at each frequency. If the resultant sum is greater than some threshold (which you must determine by experimentation) then the songs are deemed to be different, otherwise they are the same.
Upvotes: 5
Reputation: 1502935
It's possible, but very hard.
Even the same original recording may well be encoded differently by different MP3 encoders or the same encoder with different settings... leading to different results when the MP3 is then decoded. You'd need to work out an aural model to "understand" how big the differences are, and make a judgement.
Then there's the matter of different recordings. If I sing "Once in Royal David's City" and Aled Jones sings it, are those the same song? What if there are two different versions of a song where one has slightly modified lyrics? The key could be different, it could be in a different vocal range - all kinds of things.
How different can two songs be but still count as "the same song"? Once you've decided that, then there's the small matter of implementing it ;)
Upvotes: 13