Reputation: 147
I'm writing a disk space usage program in Python and I already have the functionality I want, but it is slow. The filesystems I will be analyzing can be hundreds of GBs, with thousands of files in many depth-heavy folders.
I am showing the data with a Tree Map based off of the "Split" layout in the paper linked ahead. The creation and solving of the layout are both very fast operations. http://www.nada.kth.se/utbildning/grukth/exjobb/rapportlistor/2005/rapporter05/engdahl_bjorn_05033.pdf
I am walking the given path with os.walk and adding the folders and files to this Tree Map along with their sizes. I get the size of every file and that is stored in a dictionary cache (where cache[filePath] = size) so I can easily retrieve it again. All of this is quick, except for os.walk. Running os.walk alone can take 30+ seconds, sometimes minutes.
I understand that I can't make it walk the structure any faster, but I would like to cache the results somehow so that in the future it is much faster. This is because the application allows navigation the tree map, where you can click on any section (which is a folder) and it will make that the 'root' of the tree map.
So, I am in need of a caching solution that would allow easy access to any file/folder, along with easy navigation in the hierarchy, so that if I started on the 'root' node, I could jump down to any specified child in any depth, and then from there I could move up (or down) in the structure.
I would rather not incorporate the data structure and navigation into the tree map. It would be best if the solution was in the walking and sizing part of the program. In the end, I really only need a walkable recreation of the file/folder structure with their sizes.
Any good libraries for this kind of structure? Or how easy would it be to write this myself? I haven't used a structure like this before, so I don't know the best way to create it so that I have the type of access I need.
Upvotes: 3
Views: 906
Reputation: 10397
Have you looked into Redis? Its fast, and plays well with Python. Also, what about multiple threads/processes that are initiated at a fork to do the searching faster?
Upvotes: 1