Reputation: 145
About the data : we have 2 video files which are same and audio of these files is also same but they differ in quality. that is one is in 128kbps and 320kbps respectively.
we have used ffmpeg to extract the audio from video, and generated the hash values for both the audio file using the code : ffmpeg -loglevel error -i 320kbps.wav -map 0 -f hash - the output was : SHA256=4c77a4a73f9fa99ee219f0019e99a367c4ab72242623f10d1dc35d12f3be726c similarly we did it for another audio file to which we have to compare , C:\FFMPEG>ffmpeg -loglevel error -i 128kbps.wav -map 0 -f hash - SHA256=f8ca7622da40473d375765e1d4337bdf035441bbd01187b69e4d059514b2d69a
Now we know that these audio files and hash values are different but we want to know how much different/similar they are actually , for eg: like some distance in a-b is say 3
can someone help with this?
Upvotes: 4
Views: 2750
Reputation: 4510
Cryptographic hashes like SHA-256 cannot be used to compare the distance between two audio files. Cryptographic hashes are deliberately designed to be unpredictable and to ideally reveal no information about the input that was hashed.
However, there are many suitable acoustic fingerprinting algorithms that accept a segment of audio and return a fingerprint vector. Then, you can measure the similarity of two audio clips by seeing how close together their corresponding fingerprint vectors are.
Chromaprint is a popular open source acoustic fingerprinting algorithm with bindings and reimplementations in many popular languages. Chromaprint is used by the AcoustID project, which is building an open source database to collect fingerprints and metadata for popular music.
The researcher Joren Six has also written and open-sourced the acoustic fingerprinting libraries Panako and Olaf. However, they are currently both licensed as AGPLv3 and might possibly infringe upon still-active US patents.
Several companies--such as Pex--sell APIs for checking if arbitrary audio files contain copyrighted material. If you sign up for Pex, they will give you their closed-source SDK for generating acoustic fingerprints as per their algorithm.
Here, I will assume that you chose Chromaprint. You will have to install libchromaprint and an FFT library.
I will assume that you chose Chromaprint and that you want to compare fingerprints using Python, although the general principle applies to other fingerprinting libraries.
xor
function between the fingerprints and counting the number of 1
bits.Here is some quick-and-dirty Python code for comparing the distance between two fingerprints. Although if I were building a production service, I'd implement the comparison in C++ or Rust.
from operator import xor
from typing import List
# These imports should be in your Python module path
# after installing the `pyacoustid` package from PyPI.
import acoustid
import chromaprint
def get_fingerprint(filename: str) -> List[int]:
"""
Reads an audio file from the filesystem and returns a
fingerprint.
Args:
filename: The filename of an audio file on the local
filesystem to read.
Returns:
Returns a list of 32-bit integers. Two fingerprints can
be roughly compared by counting the number of
corresponding bits that are different from each other.
"""
_, encoded = acoustid.fingerprint_file(filename)
fingerprint, _ = chromaprint.decode_fingerprint(
encoded
)
return fingerprint
def fingerprint_distance(
f1: List[int],
f2: List[int],
fingerprint_len: int,
) -> float:
"""
Returns a normalized distance between two fingerprints.
Args:
f1: The first fingerprint.
f2: The second fingerprint.
fingerprint_len: Only compare the first `fingerprint_len`
integers in each fingerprint. This is useful
when comparing audio samples of a different length.
Returns:
Returns a number between 0.0 and 1.0 representing
the distance between two fingerprints. This value
represents distance as like a percentage.
"""
max_hamming_weight = 32 * fingerprint_len
hamming_weight = sum(
sum(
c == "1"
for c in bin(xor(f1[i], f2[i]))
)
for i in range(fingerprint_len)
)
return hamming_weight / max_hamming_weight
The above functions would let you compare two fingerprints as follows:
>>> f1 = get_fingerprint("1.mp3")
>>> f2 = get_fingerprint("2.mp3")
>>> f_len = min(len(f1), len(f2))
>>> fingerprint_distance(f1, f2, f_len)
0.35 # for example
You can read more about how to use Chromaprint to compute the distance between different audio files. This mailing list thread describes the theory of how to compare Chromaprint fingerprints. This GitHub Gist offers another implementation.
Upvotes: 6
Reputation: 180303
You cannot use a SHA256 hash for this. This is intentional. It would weaken the security of the hash if you could. what you suggest is akin to differential cryptoanalysis. SHA256 is a modern cryptographic hash, and designed to be safe against such attacks.
Upvotes: 0