Reputation: 21145
I am working on a project that basically monitors a set remote directories (FTP, networked paths, and another), if the file is considered new and meets criteria we download it and process it. However i am stuck on what the best way is to keep track of the files we already downloaded. I don't want to download any duplicate files, so i need to keep track of what is already downloaded.
Orignally i was storing it as a tree:
server->directory->file_name
When the service shuts down it writes it to a file, and rereads it back when it starts up. However given that when there is around 20,000 or so files in the tree stuff starts to slow down alot.
Is there a better way to do this?
EDIT
The lookup times start to slowdown alot, my basic implementation is a dict of a dict. The storing stuff on the disk is fine, its more or less just the lookup time. I know i can optimize the tree and partition it. However that seems excessive for such a small project i was hoping python would have something like that.
Upvotes: 0
Views: 128
Reputation: 375584
I would create a set of tuples, then pickle it to a file. The tuples would be (server, directory, file_name)
, or even just (server, full_file_name_including_directory)
. There's no need for a multiple-level data structure. The tuples will hash into the set, and give you a O(1) lookup.
You mention "stuff starts to slow down alot," but you don't say if it's reading and writing time, or lookup times that are slowing down. If your lookup times are slowing down, you may be paging. Is your data structure approaching a significant fraction of your physical memory?
One way to get back some memory is to intern()
the server names. This way, each server name will be stored only once in memory.
An interesting alternative is to use a Bloom filter. This will let you use far less memory, but will occasionally download a file that you didn't have to. This might be a reasonable trade-off, depending on why you didn't want to download the file twice.
Upvotes: 1