Reputation: 2250
I am working with a lot of very large files (e.g. 1000 x 512MB) and I implemented a way to speed up things by writing certain information into databases which can be accessed when I re-run the software. For this reason I need to be able to generate unique filenames for arbitrary subsets of these files.
I have tried to generate file names based on the total file size in the subset and the file modification date and even combinations of them. The problem is that many files have the same size and the same modification date, which makes my current identifier string ambiguous. Important is only that for the same list of files, the identifier is always the same so that I can always access the correct file for the same files. Any ideas are greatly appreciated!
Here is what I use at the moment, which does not work...
import os
import glob
import datetime
file_paths = glob.glob("path/to/files/*.foo")
def modification_date(file_path):
return datetime.datetime.fromtimestamp(os.path.getmtime(filename=file_path))
uid = [modification_date(f) for f in file_paths]
uid = [d.year + d.day + d.day + d.hour + d.minute + d.second for d in uid]
uid = sum(uid) // len(uid) + sum([os.path.getsize(f) for f in file_paths])
Upvotes: 1
Views: 1779
Reputation: 22443
This may well be wildly out but as long as the files included in the list doesn't change, could you get what you are after with hashlib
?
import hashlib, glob
file_paths = glob.glob("/home/rolf/StackOverflow/*.py")
hash_object = hashlib.sha256(str(file_paths))
file_name= hash_object.hexdigest()
file_name
'f69dd9583eabae55319e9a56dbfc737cc16ab58634d9042f4c530e9a85e7d77c'
file_paths = glob.glob("/home/rolf/Research/*.py")
hash_object = hashlib.sha256(str(file_paths))
file_name= hash_object.hexdigest()
file_name
'dc4beec8e859b190d966068a52674d135f22520514ac1020b3e11a4af090e579'
Upvotes: 1
Reputation: 76194
If the master list of all files never changes, then you can uniquely encode any subset of that list into a string of about len(master_list)/8
characters. First, you create a bit array where a 0 at index i means "the ith element of the master list is not part of this subset" and a 1 means "the ith element of the master list is part of this subset". Then you convert the bit array into an integer, which you can then pack into a string.
import struct
import base64
#encodes a positive integer of any size into a string.
def string_from_int(value):
bytes = []
while value > 0:
bytes.append(value % 2**8)
value >>= 8
return struct.pack("i"*len(bytes), *bytes)
#decodes a string into a positive integer. Only works on strings whose length is divisible by 4.
def int_from_string(s):
bytes = struct.unpack("i" * (len(s)/4), s)
return sum(byte * (256**i) for i, byte in enumerate(bytes))
#encodes a subset of a list into a filename-safe string.
def encode(master_file_list, subset):
bitmask = [int(filename in subset) for filename in master_file_list]
int_value = sum(bit*2**(i) for i, bit in enumerate(bitmask))
string_value = string_from_int(int_value)
#the string may contain slashes or other characters illegal in filenames, so we b64 endoce it, at the expense of increasing the length by %20 or so
return base64.urlsafe_b64encode(string_value)
#the reverse of `encode`.
def decode(master_file_list, filename):
string_value = base64.urlsafe_b64decode(filename)
int_value = int_from_string(string_value)
subset = [value for i,value in enumerate(master_file_list) if (2**i) & int_value]
return subset
master_file_list = ['caddy.jpg', 'fjeld.dat', 'gyve.ini', 'karts.png', 'laves.txt', 'nimbi.jpg', 'ocas.ini', 'sipes.dat', 'wacky.png', 'waff.png']
subset = ["fjeld.dat", "laves.txt", "ocas.ini"]
filename = encode(master_file_list, subset)
print "Subset is:", subset
print "Filename is:", filename
print "Filename decodes back to:", decode(master_file_list, filename)
Result:
Subset is: ['fjeld.dat', 'laves.txt', 'ocas.ini']
Filename is: UgAAAA==
Filename decodes back to: ['fjeld.dat', 'laves.txt', 'ocas.ini']
Upvotes: 0