Find duplicate content among millions of user-edited audio files (audio content hashing)

Question

I have a problem that consists of processing more than a million audio files (from user generated video content) that might have been editted (mostly cutting) and uploaded in various qualities. My task is to map all duplicates to one single item ID so we can later filter to just show those videos with full length and best quality.

Since the visual difference between videos might not vary between different files, we would like to use the audio tracks for our purposes. This is why I'm searching for audio content hashing that is kinda resistant to such things as above. You might call it the 'Shazam'-problem.

My question is: what would you think is the most easiest way to find these potential duplicates (manual approval can be done) ?

A subquestion would be: how would you solve the issue of not processing phase-different chunks of audio files (making sure that hash input from 2 different length audios is always the same).

My current approach would be to process through the audio and with each local high on the sound wave within a given time window generate some kind of hash on the following 20-30 second chunk. I can easily store a few dozen hashes per file as long as the duplicate lookup process is some kind of a key-value lookup and not an intersection with all the other hashes.

I have no meta data or anything else that could be used.

DrKoch · Accepted Answer

There is a very good description how shazam works internally:

An Industrial-Strength Audio Search Algorithm

They search for most prominent frequency components and their relative distance and store these distances in a clever way which allows for fast search and match.

This may look very complicated, but for a robust fingerprinting of audio files there is some effort required, this is not a trivial problem at all.

Find duplicate content among millions of user-edited audio files (audio content hashing)

Answers (1)

Related Questions