user2983738
user2983738

Reputation: 65

How to use asyncio for basic file IO

Basically i am currently doing the following:

for bigLogFile in bigLogFileFolder:
    with open(bigLogFile) as bigLog:
        processBigLogfile(bigLog)

Since I am loading this log file from a network drive, the largest part of the execution time is waiting for the file to load. However, the execution time of processBigLogFile is not trivial as well.

So my basic idea was to make the process asynchronous, allowing the program to load the next log file while the current log is being processed. Seems simple enough, but I have no experience whatsoever with asynchronous programming and asyncio seems to offer a lot of different ways to achieve what I want to do (Using Task or Future seemed to be the likely candidates).

Can anybody show me the easiest way to achieve this? Asyncio is not strictly necessary, but I would strongly prefer using a built-in library

It should be noted that the log files have to be processed sequentially, so I can't simply parallellize loading and processing files

Upvotes: 1

Views: 734

Answers (2)

quamrana
quamrana

Reputation: 39374

It sounds like you want the file opening to be parallelised, but the processing to be sequential. I'm not sure whether that will save you any time.

from concurrent.futures import ThreadPoolExecutor, as_completed

bigLogFileFolder = [...]

num = len(bigLogFileFolder)

pool = ThreadPoolExecutor(num)

futures = [pool.submit(open, bigLogFile) for bigLogFile in bigLogFileFolder]

for x in as_completed(futures):
    processBigLogFile(x.result())

Upvotes: 0

freakish
freakish

Reputation: 56477

There's no need for complicated asynchrounous coding when the same can be achieved with simple ThreadPoolExecutor:

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=1) as tp:
    for bigLogFile in bigLogFileFolder:
        with open(bigLogFile) as bigLog:
            data = bigLog.read()
            tp.submit(process_data, data)

Since the ThreadPoolExecutor uses a queue under the hood then the order of processing will be preserved as long as max_workers=1.

Also if you have enough memory to hold all/most files it will work fine. If you are memory-bound then you have to wait for the ThreadPoolExecutor to finish some tasks.

Upvotes: 1

Related Questions