Reputation: 36937
What would be the best way to compare files and or directories. Lets say I want to store files on a sever or collective of servers like a cloud based system. My users are in collaboration with one another in many cases and some not. Either way I can have upwards of a hundred people or more with the same exact file. Only key difference is they likey renamed the file or whatever. But essentially same exact data all around. Now other thing is there is no specific file type. There's pdf, doc, docx, txt, videos, audio files, etc.. but the issue boils down to the same files over and over. What i want to do is cut it down. Remove the hundreds of dupes and with the help of a database store things like the file name the user provided so I can in turn store the single file left how and where I want while still providing the info they used essentially.
Now i know I can do something with md5 or sha1 or sha2 or something equivalent that will essentially give me a unique value I can use for such comparisons. But i am not exactly sure how or where to begin with that. Such as how with php can I get the sha or md5 of a file? When i look up stuff for those I get methods for strings but not files..
Overall I am here looking to bounce ideas around to figuring this out not so much as a direct means.. any help would be great.
Upvotes: 2
Views: 709
Reputation: 141827
You can use :
md5(file_get_contents($filename));
To generate a hash for a file.
With that in mind, two entirely different files will produce the exact same md5 hash (Same problem with the other hashes, although you can have much less collisions by using a better hash method than md5). To compare two files you need to do it byte by byte, but you don't want to analyze every byte of every file on the hard disk to find a match.
What you need to do is store the hash for every file in your database in an a column, which should also be an index.
Then you can select all files with the same hash as the new file from your database. That will give you a small list of files. Say you have 100,000 files on the disc. You might get a list of a few files that match the hash. Most of the time the lists will be short. Then you can loop through those files byte by byte to see if it's a match. Searching through a list of the ~10 files that have the same hash will save you from searching through all 100,000 files, but you still need to do the byte by byte comparison, because those 10 files could all be very different.
Upvotes: 1
Reputation: 53546
There are many ways you can accomplish such a system. But if I'd have to write one from scratch, this is most likely how I would do it :
have three database tables (in pseudocode) :
table users {
id integer ## PK
username string
password string ## sha1
...
}
table user_files {
user_id integer ## Composite INDEX
file_id integer ##
filename string
}
table files {
id integer ## PK
uniq_id string ## basically 'yyyMMddhhmmssRRRR' INDEX
sha_hash string ## sha1
md5_hash string ## md5
}
Where files.sha_hash
is the result of computing the sha1
of the file, files.md5_hash
is the result of computing the md5
of the same file, as double security check, and files.filename
the actual file name. On the server, the file would be stored and renamed to files.uniq_id
to make sure there is no name collision, where the last RRRR
chars represents a random string (cycle RRRR
until uniq_id
is unique in the database)
Note : PHP provides sha1_file
and md5_file
. Use these when computing files!
When a user stores a file, process the file (describe in step 1) and save it appropriately. To avoid having too many files in the same folder on the server, you may decompose files.uniq_id
and separate each files into yyyy/MM
sub folders.
Next, associate user_files.file_id = files.id
and user_files.user_id = users.id
and set user_files.filename
to the uploaded file name (see next step).
If a user uploads another file, process the result as in 2, but check whether the result match another files.sha_hash
, files.md5_hash
. At this point, if we have a match, it doesn't matter what name the file has, it's very likely the exact same file, so associate the found user_files.file_id = files.id
and user_files.user_id = users.id
and set user_files.filename
to the uploaded file name.
Note : this will cause to have 1
physical file and 2
virtual files on your server.
If a user rename a file without modifying it, simply rename user_files.filename
matching the file he/she wants to rename.
If a user deletes a file, check how many user_files.file_id
matches and only if 1
match is found, delete the physical file and the files
entry. Otherwise, simply remove the user_files
association.
If a user modifies the file with or without renaming it, perform a delete (step 5) and another save (step 3)
Upvotes: 1
Reputation: 8186
$filePath = '/var/www/site/public/uploads/foo.txt'
$data = file_get_contents($filePath);
$key = sha1($data); //or $key = sha1_file($filePath);
Save this $key in a column of table also mark that column as UNIQUE so no to same file can be stored by default.
Use sha1 instead of md5 since many version control system like git use sha1 hash itself to identify uniqueness of file
Upvotes: 3
Reputation: 8417
When a file is uploaded:
When a file is requested:
Upvotes: 2
Reputation: 1845
To get the md5 hash of a file at $path
...
$hash = md5(file_get_contents($path));
Hope this partially answers your question.
Upvotes: 1
Reputation: 60007
Upvotes: -3