Reputation: 36937

Uniquely identify Files and Directories on a Server for Comparison

What would be the best way to compare files and or directories. Lets say I want to store files on a sever or collective of servers like a cloud based system. My users are in collaboration with one another in many cases and some not. Either way I can have upwards of a hundred people or more with the same exact file. Only key difference is they likey renamed the file or whatever. But essentially same exact data all around. Now other thing is there is no specific file type. There's pdf, doc, docx, txt, videos, audio files, etc.. but the issue boils down to the same files over and over. What i want to do is cut it down. Remove the hundreds of dupes and with the help of a database store things like the file name the user provided so I can in turn store the single file left how and where I want while still providing the info they used essentially.

Now i know I can do something with md5 or sha1 or sha2 or something equivalent that will essentially give me a unique value I can use for such comparisons. But i am not exactly sure how or where to begin with that. Such as how with php can I get the sha or md5 of a file? When i look up stuff for those I get methods for strings but not files..

Overall I am here looking to bounce ideas around to figuring this out not so much as a direct means.. any help would be great.

Upvotes: 2

Answers (6)

Paul

Reputation: 141827

You can use :

md5(file_get_contents($filename));

To generate a hash for a file.

With that in mind, two entirely different files will produce the exact same md5 hash (Same problem with the other hashes, although you can have much less collisions by using a better hash method than md5). To compare two files you need to do it byte by byte, but you don't want to analyze every byte of every file on the hard disk to find a match.

What you need to do is store the hash for every file in your database in an a column, which should also be an index.

Then you can select all files with the same hash as the new file from your database. That will give you a small list of files. Say you have 100,000 files on the disc. You might get a list of a few files that match the hash. Most of the time the lists will be short. Then you can loop through those files byte by byte to see if it's a match. Searching through a list of the ~10 files that have the same hash will save you from searching through all 100,000 files, but you still need to do the byte by byte comparison, because those 10 files could all be very different.

Upvotes: 1

Yanick Rochon

Reputation: 53546

There are many ways you can accomplish such a system. But if I'd have to write one from scratch, this is most likely how I would do it :

have three database tables (in pseudocode) :
```
table users {
   id integer         ## PK
   username string
   password string    ## sha1
   ...
}

table user_files {
   user_id integer    ## Composite INDEX
   file_id integer    ## 
   filename string
}

table files {
   id integer           ## PK
   uniq_id string       ## basically 'yyyMMddhhmmssRRRR' INDEX
   sha_hash string      ## sha1
   md5_hash string      ## md5
}
```
Where files.sha_hash is the result of computing the sha1 of the file, files.md5_hash is the result of computing the md5 of the same file, as double security check, and files.filename the actual file name. On the server, the file would be stored and renamed to files.uniq_id to make sure there is no name collision, where the last RRRR chars represents a random string (cycle RRRR until uniq_id is unique in the database)

Note : PHP provides sha1_file and md5_file. Use these when computing files!
When a user stores a file, process the file (describe in step 1) and save it appropriately. To avoid having too many files in the same folder on the server, you may decompose files.uniq_id and separate each files into yyyy/MM sub folders.

Next, associate user_files.file_id = files.id and user_files.user_id = users.id and set user_files.filename to the uploaded file name (see next step).
If a user uploads another file, process the result as in 2, but check whether the result match another files.sha_hash, files.md5_hash. At this point, if we have a match, it doesn't matter what name the file has, it's very likely the exact same file, so associate the found user_files.file_id = files.id and user_files.user_id = users.id and set user_files.filename to the uploaded file name.

Note : this will cause to have 1 physical file and 2 virtual files on your server.
If a user rename a file without modifying it, simply rename user_files.filename matching the file he/she wants to rename.
If a user deletes a file, check how many user_files.file_id matches and only if 1 match is found, delete the physical file and the files entry. Otherwise, simply remove the user_files association.
If a user modifies the file with or without renaming it, perform a delete (step 5) and another save (step 3)

Upvotes: 1

Mr Coder

Reputation: 8186

$filePath = '/var/www/site/public/uploads/foo.txt'
$data = file_get_contents($filePath); 

$key = sha1($data);   //or     $key = sha1_file($filePath);

Save this $key in a column of table also mark that column as UNIQUE so no to same file can be stored by default.

Use sha1 instead of md5 since many version control system like git use sha1 hash itself to identify uniqueness of file

Upvotes: 3

Zack Bloom

Reputation: 8417

When a file is uploaded:

Compute the hash (SHA1, etc.)
Rename the file to that hash and store it (unless a file with that hash already exists [you already have it])
Store the hash in your database.

When a file is requested:

Get the hash from your database
Return the file based on the hash
Use HTTP headers so the user's browser provides it to them with the filename they used

Upvotes: 2

Chris Carson

Reputation: 1845

To get the md5 hash of a file at $path...

$hash = md5(file_get_contents($path));

Hope this partially answers your question.

Upvotes: 1

Ed Heal

Reputation: 60007

Is it necessary? Hard disk is very cheap these days so who cares for the duplicates? I would imagine that are not that big?
MD5 et al. are not unique. Just a quick way of saying that two files are not the same. It is possible for two files to have the same MD5 value but contain different data.

Upvotes: -3

Uniquely identify Files and Directories on a Server for Comparison

Answers (6)

Related Questions