Reputation: 17231
I am parsing a folder structure that is quite heavy (in terms of the number of folders and files). I have to go through all the folders and parse any files I come across. The files themselves are small (1000-2000 characters although a few are bigger). I have two options:
- Go through all the folders and files and parse any that I come across in one big recursive loop.
- Go through all the folders and store the paths of all the files that I come across. The in another loop, parse the files by referring to the stored file paths.
Which option would be better and maybe faster (the speed will most likely be I/O bound so most likely will not make a difference, but I thought I'd ask anyway)?
Upvotes: 2
Views: 133
Reputation: 52169
"The most readable and the most understandable" almost always means "the simplest and the easiest way." (Although some code is inherently complex. That's still not an excuse to write unreadable code.) Option 1 sounds easier to implement in my opinion, but try it for yourself. Profile for bottlenecks if it isn't fast enough.
Most likely, the actual disk I/O will take much longer than the total processor cycles or memory accesses needed for either option, so which option you take might not even be relevant. But the only way to know for sure how fast your programs are running and whether you need improvements is by profiling.
Upvotes: 7
Reputation: 48277
It depends a lot on how deep the folder structure will be and how much data you'll have to hold in memory (including number of files/filenames).
If you have an extremely deep structure, you could run into a stack overflow. However, with path length limits, it's not very likely. You will have to store all the file names in memory, which is probably going to be a pain, but probably won't actually be a problem.
Assuming the functions are reasonable simply, it will likely be easier to simply call the recursive search function for each directory you find and the file parser for each valid file, all in a single loop:
function search_folder:
for each item in curdir:
if item is file:
parse_file(item)
else if item is folder:
search_folder(item)
That gives you a relatively simple and very readable structure, at the cost of potentially deep recursion. Caching filenames and going through them later involves a lot more code and will likely be less readable, and (assuming you handle directories that same way) will have the same amount of recursion.
I'd go with #1, since it seems the more flexible and elegant solution.
Upvotes: 0
Reputation: 20272
The options seem to be functionally identical. I would say the consideration should be readability and maintainability - what is easier to understand and change later on when needed, expand or fix bugs in.
It is also worth considering breaking the functionality into separate objects - one is performing the search while the other is parsing the files found. Then you can run them concurrently and achieve better CPU utilization.
Upvotes: 0
Reputation: 18143
How about one thread that creates the list of file names to process, and another thread that reads through that list of files and uses one of a handful of worker threads to do the processing?
I don't know how many directories there are, but just guessing that's not the big time sink. I'd say you'd get the best performance by having a thread pool, each thread in the pool parsing a file (once you have the list of them.) Because that stuff is gonna be so IO bound, the threading will probably make things far more efficient.
Upvotes: 2