iJade
iJade

Reputation: 23801

Reading files using different threads

I'm trying to read and modify each row of a number of files using Python. Each file has thousands to hundreds of thousands of rows and hence each file is processed only after another one if processed. I'm trying to read the files like:

csvReader = csv.reader(open("file","r")
for row in csvReader:
    handleRow(row)

I want to use multi threading to read each of the files using a different thread parallel in order to save time. Can anyone point out if it would be useful as well as how to implement that?

Upvotes: 2

Views: 1507

Answers (2)

dbra
dbra

Reputation: 631

Probably your bottleneck is the I/O, so multithreading would not help; anyway, is easy to try: the following code elaborates all the files on current directory, one thread for file, by applying a given string function to each row and writing the new file on a given path.

from threading import Thread
from os import listdir
from os.path import basename, join, isfile

class FileChanger(Thread):
     def __init__(self, sourcefilename, rowfunc, tgpath):
         Thread.__init__(self)
         self.rowfunc = rowfunc
         self.sfname = sourcefilename
         self.tgpath = tgpath

     def run(self):
         tgf = open(join(self.tgpath, basename(self.sfname)), 'w')
         for r in open(self.sfname):
             tgf.write(self.rowfunc(r))
         tgf.close()

# main #
workers = [FileChanger(f, str.upper, '/tmp/tg') \
                for f in listdir('.') if isfile(f)]
for w in workers:
    w.start()
for w in workers:
    w.join()

Upvotes: 1

abarnert
abarnert

Reputation: 365717

It may or may not be useful--if all the files are on the same drive, and you're already pushing the drive as fast as it can go, multiplexing can only slow things down. But if you're not maxing out your I/O it'll speed things up.

As far as how to do it, that's trivial. Wrap your code up in a function that takes a pathname, then use a concurrent.futures.ThreadPoolExecutor or a multiprocessing.dummy.Pool and it's one line of code to map your function over your whole iterable of pathnames:

with ThreadPoolExecutor(4) as executor:
    executor.map(func, paths)

One more thing: if the reason you can't max out the I/O is because you're doing too much CPU work on each line, threads won't help in Python (because of the GIL), but you can just use processes--the exact same code, but with ProcessorPoolExecutor.

Upvotes: 3

Related Questions