Reputation: 453
Here is my current design:
The issue that I am running into is that there are many files which are extremely similar(differ by 0.1% or less), which is very insignificant for most videos and images. Most of those files are around 10MB, but some are larger. I am trying to find a better method of finding similar files, and eventually saving just the differences between similar files instead of having to store both.
I have tried a couple different methods, however all of them were either extremely picky, did not work on binary files, or could not process files larger than a couple kilobytes. As of right now I am considering making my own hashing method that will do something like this:
$a=strlen($f);
$p=$a/1000;
$hash='';
for($c=0; $c<1000; $c++)
{
$ll='';
for($i=0; $i<=$p; $i++)
{
$ll+=ord(substr($f, $c*$a/1000 + $i, 1));
}
$hash.=chr($ll%26 + 65); //at the end, this is going to be a 1000 character hash.
}
The code above does a great job for files that have the SAME SIZE, however it is practically useless for files with different size..
Hmm. Instead of trying to do anything with hashing, I am going to try examine file's properties... for now i am just going to do 'number of consecutive bytes with same values' or something like that. Update: that did not work as planned. By adding mere 10kb to a 100MB file the values change completely.
New approach: Distances between sequences of bytes.
Upvotes: 0
Views: 58
Reputation: 453
After lots of different designs, here is what I ended up with:
$f=file_get_contents($l['sha1']);
$a=strlen($f);
$ff='';
for($xd=0; $xd<=$a; $xd++)
{
$ff.=chr(ord(substr($f, $xd, 1))%26 + 65);
}
$hash='';
$toFind='AAA';
$start = 0;
$pos=0;
while(($pos = strpos(($ff),$toFind,$start)) !== false) {
$hash.=chr(abs($start-$pos)%93 + 32);
$start = $pos+1; // start searching from next position.
}
$hash=substr($hash,0,19999);
Amazingly enough, this works great for files under 10MB, for files greater than 10MB '$toFind' string changes to 'AAAA'. It is able to find similar files and tell exactly where the differences are.
Upvotes: 0
Reputation: 78433
For text, computing the size of a diff is a relatively proxy of how similar or dissimilar two files are.
For images (and videos), it's a yet to be properly solved problem insofar as I'm aware, so the best I'd expect you'll get are rough ideas to play with.
I'd imagine it's possible to compute a reasonably OK proxy by analyzing an image's Fourier transform. Perhaps by normalizing it for amplitude or bandwidth in some ways, by eliminating low amplitude frequencies, or by sampling it, or perhaps all of those.
I'm afraid my signal processing skills are far too rusty to say precisely what or how. But it could be a lead if you've some familiarity with the maths involved, and there is a DSP Stack Exchange where you might be able to shop for up-to-date specifics if this is indeed the right approach:
Upvotes: 1