Syntax Rommel
Syntax Rommel

Reputation: 942

Reading multiple file using thread/multiprocess

I am currently pulling .txt files from the path list of FileNameList, which is working. But my main problem is, it is too slow when the files is too many.

I am using this code to print list of txt files,

import os
import sys

#FileNameList is my set of files from my path
for filefolder in FileNameList: 
  for file in os.listdir(filefolder): 
    if "txt" in file:
        filename = filefolder + "\\" + file     
        print filename

Any help or suggestion to have thread/multiprocess and make it fast reading will accept. Thanks in advance.

Upvotes: 7

Views: 9895

Answers (4)

Roland Smith
Roland Smith

Reputation: 43495

So you mean there is no way to speed this up?, because my scenario is to read bunch of files then read each lines of it and store it to the database

The first rule of optimization is to ask yourself if you should bother. If your program is run only once or a couple of times optimizing it is a waste of time.

The second rule is that before you do anything else, measure where the problem lies;

Write a simple program that sequentially reads files, splits them into lines and stuffs those in a database. Run that program under a profiler to see where the program is spending most of its time.

Only then do you know which part of the program needs speeding up.


Here are some pointers nevertheless.

  • Speading up the reading of files can be done using mmap.
  • You could use multiprocessing.Pool to spread out the reading of multiple files over different cores. But then the data from those files will end up in different processes and would have to be sent back to the parent process using IPC. This has significant overhead for large amounts of data.
  • In the CPython implementation of Python, only one thread at a time can be executing Python bytecode. While the actual reading from files isn't inhibited by that, processing the results is. So it is questionable if threads would offer improvement.
  • Stuffing the lines into a database will probably always be a major bottleneck, because that is where everything comes together. How much of a problem this is depends on the database. Is it in-memory or on disk, does it allow multiple programs to update it simultaneously, et cetera.

Upvotes: 5

Uwe
Uwe

Reputation: 11

In this case you can try to use multithreading. But keep in mind that every non atomic operation will run in a single thread, due to the Python GIL (Global Interpreter Lock). If you are running multiple machines it can be possible that you are faster. You can use something like a worker producer:

  • Producer (one thread) will hold the file list and a queue
  • Worker (more than one thread) will collect the file informations from the queue and push the content to the database

Look at the queues and pipes in multiprocessing (real separated subprocesses) to sidestep the GIL.

With this two communication objects you can build some nice blocking or non blocking programs.

Side note: keep in mind not every db connection is thread-safe.

Upvotes: 1

Mike McKerns
Mike McKerns

Reputation: 35217

You can get some speed-up, depending on the number and size of your files. See this answer to a similar question: Efficient file reading in python with need to split on '\n'

Essentially, you can read multiple files in parallel with multithreading, multiprocessing, or otherwise (e.g. an iterator)… and you may get some speedup. The easiest thing to do is to use a library like pathos (yes, I'm the author), which provides multiprocessing, multithreading, and other options in a single common API -- basically, so you can code it once, and then switch between different backends until you have what works the fastest for your case.

There are a lot of options for different types of maps (on the pool object), as you can see here: Python multiprocessing - tracking the process of pool.map operation.

While the following isn't the most imaginative of examples, it shows a doubly-nested map (equivalent to a doubly-nested for loop), and how easy it is to change the backends and other options on it.

>>> import pathos
>>> p = pathos.pools.ProcessPool()
>>> t = pathos.pools.ThreadPool()
>>> s = pathos.pools.SerialPool()
>>> 
>>> f = lambda x,y: x+y
>>> # two blocking maps, threads and processes
>>> t.map(p.map, [f]*5, [range(i,i+5) for i in range(5)], [range(i,i+5) for i in range(5)])
[[0, 2, 4, 6, 8], [2, 4, 6, 8, 10], [4, 6, 8, 10, 12], [6, 8, 10, 12, 14], [8, 10, 12, 14, 16]]
>>> # two blocking maps, threads and serial (i.e. python's map)
>>> t.map(s.map, [f]*5, [range(i,i+5) for i in range(5)], [range(i,i+5) for i in range(5)])
[[0, 2, 4, 6, 8], [2, 4, 6, 8, 10], [4, 6, 8, 10, 12], [6, 8, 10, 12, 14], [8, 10, 12, 14, 16]]
>>> # an unordered iterative and a blocking map, threads and serial
>>> t.uimap(s.map, [f]*5, [range(i,i+5) for i in range(5)], [range(i,i+5) for i in range(5)])
<multiprocess.pool.IMapUnorderedIterator object at 0x103dcaf50>
>>> list(_)
[[0, 2, 4, 6, 8], [2, 4, 6, 8, 10], [4, 6, 8, 10, 12], [6, 8, 10, 12, 14], [8, 10, 12, 14, 16]]
>>> 

I have found that generally, unordered iterative maps (uimap) are the fastest, but then you have to not care which order something is processed as it might get out of order on the return. As far as speed… surround the above with a call to time.time or similar.

Get pathos here: https://github.com/uqfoundation

Upvotes: 2

Cyphase
Cyphase

Reputation: 12022

Multithreading or multiprocessing is not going to speed this up; your bottleneck is the storage device.

Upvotes: 4

Related Questions