Dimitri
Dimitri

Reputation: 453

finding somewhat similar files

Here is my current design:

  1. User uploads a file
  2. Server checks sha256 of the file to see if it already exists
  3. If file does not already exist: rename file into its sha256 hash, store file's original name in database and sha256 of that file. otherwise Add another entry to database, without storing file on server.

The issue that I am running into is that there are many files which are extremely similar(differ by 0.1% or less), which is very insignificant for most videos and images. Most of those files are around 10MB, but some are larger. I am trying to find a better method of finding similar files, and eventually saving just the differences between similar files instead of having to store both.

I have tried a couple different methods, however all of them were either extremely picky, did not work on binary files, or could not process files larger than a couple kilobytes. As of right now I am considering making my own hashing method that will do something like this:

$a=strlen($f);
$p=$a/1000;
$hash='';

for($c=0; $c<1000; $c++)
{
$ll='';
for($i=0; $i<=$p; $i++)
{
$ll+=ord(substr($f, $c*$a/1000 + $i, 1));
}
$hash.=chr($ll%26 + 65); //at the end, this is going to be a 1000 character hash.
}

The code above does a great job for files that have the SAME SIZE, however it is practically useless for files with different size..

Hmm. Instead of trying to do anything with hashing, I am going to try examine file's properties... for now i am just going to do 'number of consecutive bytes with same values' or something like that. Update: that did not work as planned. By adding mere 10kb to a 100MB file the values change completely.

New approach: Distances between sequences of bytes.

Upvotes: 0

Views: 58

Answers (2)

Dimitri
Dimitri

Reputation: 453

After lots of different designs, here is what I ended up with:

$f=file_get_contents($l['sha1']);

$a=strlen($f);
$ff='';
for($xd=0; $xd<=$a; $xd++)
{
$ff.=chr(ord(substr($f, $xd, 1))%26 + 65);
}

$hash='';
$toFind='AAA';
$start = 0;
$pos=0;
while(($pos = strpos(($ff),$toFind,$start)) !== false) {
        $hash.=chr(abs($start-$pos)%93 + 32);
        $start = $pos+1; // start searching from next position.
}

$hash=substr($hash,0,19999);

Amazingly enough, this works great for files under 10MB, for files greater than 10MB '$toFind' string changes to 'AAAA'. It is able to find similar files and tell exactly where the differences are.

Upvotes: 0

Denis de Bernardy
Denis de Bernardy

Reputation: 78433

For text, computing the size of a diff is a relatively proxy of how similar or dissimilar two files are.

For images (and videos), it's a yet to be properly solved problem insofar as I'm aware, so the best I'd expect you'll get are rough ideas to play with.

I'd imagine it's possible to compute a reasonably OK proxy by analyzing an image's Fourier transform. Perhaps by normalizing it for amplitude or bandwidth in some ways, by eliminating low amplitude frequencies, or by sampling it, or perhaps all of those.

I'm afraid my signal processing skills are far too rusty to say precisely what or how. But it could be a lead if you've some familiarity with the maths involved, and there is a DSP Stack Exchange where you might be able to shop for up-to-date specifics if this is indeed the right approach:

http://dsp.stackexchange.com

Upvotes: 1

Related Questions