Reputation: 942
I am currently pulling .txt files from the path list of FileNameList, which is working. But my main problem is, it is too slow when the files is too many.
I am using this code to print list of txt files,
import os
import sys
#FileNameList is my set of files from my path
for filefolder in FileNameList:
for file in os.listdir(filefolder):
if "txt" in file:
filename = filefolder + "\\" + file
print filename
Any help or suggestion to have thread/multiprocess and make it fast reading will accept. Thanks in advance.
Upvotes: 7
Views: 9895
Reputation: 43495
So you mean there is no way to speed this up?, because my scenario is to read bunch of files then read each lines of it and store it to the database
The first rule of optimization is to ask yourself if you should bother. If your program is run only once or a couple of times optimizing it is a waste of time.
The second rule is that before you do anything else, measure where the problem lies;
Write a simple program that sequentially reads files, splits them into lines and stuffs those in a database. Run that program under a profiler to see where the program is spending most of its time.
Only then do you know which part of the program needs speeding up.
Here are some pointers nevertheless.
mmap
.multiprocessing.Pool
to spread out the reading of multiple files over different cores. But then the data from those files will end up in different processes and would have to be sent back to the parent process using IPC. This has significant overhead for large amounts of data.Upvotes: 5
Reputation: 11
In this case you can try to use multithreading. But keep in mind that every non atomic operation will run in a single thread, due to the Python GIL (Global Interpreter Lock). If you are running multiple machines it can be possible that you are faster. You can use something like a worker producer:
Look at the queues and pipes in multiprocessing (real separated subprocesses) to sidestep the GIL.
With this two communication objects you can build some nice blocking or non blocking programs.
Side note: keep in mind not every db connection is thread-safe.
Upvotes: 1
Reputation: 35217
You can get some speed-up, depending on the number and size of your files. See this answer to a similar question: Efficient file reading in python with need to split on '\n'
Essentially, you can read multiple files in parallel with multithreading, multiprocessing, or otherwise (e.g. an iterator)… and you may get some speedup. The easiest thing to do is to use a library like pathos
(yes, I'm the author), which provides multiprocessing, multithreading, and other options in a single common API -- basically, so you can code it once, and then switch between different backends until you have what works the fastest for your case.
There are a lot of options for different types of maps (on the pool
object), as you can see here: Python multiprocessing - tracking the process of pool.map operation.
While the following isn't the most imaginative of examples, it shows a doubly-nested map (equivalent to a doubly-nested for loop), and how easy it is to change the backends and other options on it.
>>> import pathos
>>> p = pathos.pools.ProcessPool()
>>> t = pathos.pools.ThreadPool()
>>> s = pathos.pools.SerialPool()
>>>
>>> f = lambda x,y: x+y
>>> # two blocking maps, threads and processes
>>> t.map(p.map, [f]*5, [range(i,i+5) for i in range(5)], [range(i,i+5) for i in range(5)])
[[0, 2, 4, 6, 8], [2, 4, 6, 8, 10], [4, 6, 8, 10, 12], [6, 8, 10, 12, 14], [8, 10, 12, 14, 16]]
>>> # two blocking maps, threads and serial (i.e. python's map)
>>> t.map(s.map, [f]*5, [range(i,i+5) for i in range(5)], [range(i,i+5) for i in range(5)])
[[0, 2, 4, 6, 8], [2, 4, 6, 8, 10], [4, 6, 8, 10, 12], [6, 8, 10, 12, 14], [8, 10, 12, 14, 16]]
>>> # an unordered iterative and a blocking map, threads and serial
>>> t.uimap(s.map, [f]*5, [range(i,i+5) for i in range(5)], [range(i,i+5) for i in range(5)])
<multiprocess.pool.IMapUnorderedIterator object at 0x103dcaf50>
>>> list(_)
[[0, 2, 4, 6, 8], [2, 4, 6, 8, 10], [4, 6, 8, 10, 12], [6, 8, 10, 12, 14], [8, 10, 12, 14, 16]]
>>>
I have found that generally, unordered iterative maps (uimap
) are the fastest, but then you have to not care which order something is processed as it might get out of order on the return. As far as speed… surround the above with a call to time.time
or similar.
Get pathos
here: https://github.com/uqfoundation
Upvotes: 2
Reputation: 12022
Multithreading or multiprocessing is not going to speed this up; your bottleneck is the storage device.
Upvotes: 4