Reputation: 65
Basically i am currently doing the following:
for bigLogFile in bigLogFileFolder:
with open(bigLogFile) as bigLog:
processBigLogfile(bigLog)
Since I am loading this log file from a network drive, the largest part of the execution time is waiting for the file to load. However, the execution time of processBigLogFile is not trivial as well.
So my basic idea was to make the process asynchronous, allowing the program to load the next log file while the current log is being processed. Seems simple enough, but I have no experience whatsoever with asynchronous programming and asyncio seems to offer a lot of different ways to achieve what I want to do (Using Task or Future seemed to be the likely candidates).
Can anybody show me the easiest way to achieve this? Asyncio is not strictly necessary, but I would strongly prefer using a built-in library
It should be noted that the log files have to be processed sequentially, so I can't simply parallellize loading and processing files
Upvotes: 1
Views: 734
Reputation: 39374
It sounds like you want the file opening to be parallelised, but the processing to be sequential. I'm not sure whether that will save you any time.
from concurrent.futures import ThreadPoolExecutor, as_completed
bigLogFileFolder = [...]
num = len(bigLogFileFolder)
pool = ThreadPoolExecutor(num)
futures = [pool.submit(open, bigLogFile) for bigLogFile in bigLogFileFolder]
for x in as_completed(futures):
processBigLogFile(x.result())
Upvotes: 0
Reputation: 56477
There's no need for complicated asynchrounous coding when the same can be achieved with simple ThreadPoolExecutor:
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=1) as tp:
for bigLogFile in bigLogFileFolder:
with open(bigLogFile) as bigLog:
data = bigLog.read()
tp.submit(process_data, data)
Since the ThreadPoolExecutor uses a queue under the hood then the order of processing will be preserved as long as max_workers=1
.
Also if you have enough memory to hold all/most files it will work fine. If you are memory-bound then you have to wait for the ThreadPoolExecutor to finish some tasks.
Upvotes: 1