Reputation: 10469
I have a huge number of report files (about 650 files) which takes about 320 M of hard disk and I want to process them. There are a lot of entries in each file; I should count and log them based on their content. Some of them are related to each other and I should find, log and count them too; matches may be in different files. I have wrote a simple script to do the job. I used python profiler and it just took about 0.3 seconds to run the script for one single file with 2000 lines that we need half of them for processing. But for the whole directory it took 1 hour and a half to be done. This is how my script looks like:
# imports
class Parser(object):
def __init__(self):
# load some configurations
# open some log files
# set some initial values for some variables
def parse_packet(self, tags):
# extract some values from line
def found_matched(self, packet):
# search in the related list to find matched line
def save_packet(self, packet):
# write the line in the appropriate files and increase or decrease some counters
def parse(self, file_addr):
lines = [l for index, l in enumerate(open(file_addr, 'r').readlines()) if index % 2 != 0]
for line in lines:
packet = parse_packet(line)
if found_matched(packet):
# count
self.save_packet(packet)
def process_files(self):
if not os.path.isdir(self.src_dir):
self.log('No such file or directory: ' + str(self.src_dir))
sys.exit(1)
input_dirs = os.walk(self.src_dir)
for dname in input_dirs:
file_list = dname[2]
for fname in file_list:
self.parse(os.path.join(dname[0], fname))
self.finalize_process()
def finalize_process(self):
# closing files
I want to decrease the time at least to the 10% percent of current execution time. Maybe multiprocessing
can help me or just some enhancement in current script will do the task. Anyway could you please help me in this?
Edit 1:
I have changed my code according to @Reut Sharabani's answer:
def parse(self, file_addr):
lines = [l for index, l in enumerate(open(file_addr, 'r').readlines()) if index % 2 != 0]
for line in lines:
packet = parse_packet(line)
if found_matched(packet):
# count
self.save_packet(packet)
def process_files(self):
if not os.path.isdir(self.src_dir):
self.log('No such file or directory: ' + str(self.src_dir))
sys.exit(1)
input_dirs = os.walk(self.src_dir)
for dname in input_dirs:
process_pool = multiprocessing.Pool(10)
for fname in file_list:
file_list = [os.path.join(dname[0], fname) for fname in dname[2]]
process_pool.map(self.parse, file_list)
self.finalize_process()
I also added below lines before my class definition to avoid PicklingError: Can't pickle <type 'instancemethod'>: attribute lookup
__builtin__.instancemethod failed
:
import copy_reg
import types
def _pickle_method(m):
if m.im_self is None:
return getattr, (m.im_class, m.im_func.func_name)
else:
return getattr, (m.im_self, m.im_func.func_name)
copy_reg.pickle(types.MethodType, _pickle_method)
Another thing that I have done into my code was not to keep open log files during file processing; I open and close them for writing each entry just to avoid ValueError: I/O operation on closed file
.
Now the problem is that I have some files which are being processed multiple times. I also got wrong counts for my packets. What did I do wrong? Should I put process_pool = multiprocessing.Pool(10)
before the for loop? Consider that I have just one directory right now and it doesn't seem to be the problem.
EDIT 2:
I also tried using ThreadPoolExecutor
this way:
with ThreadPoolExecutor(max_workers=10) as executor:
for fname in file_list:
executor.submit(self.parse, fname)
Results were correct, but it took an hour and a half to be completed.
Upvotes: 3
Views: 4688
Reputation: 5855
Aside from using parallel processing, your parse
method is rather inefficient as @Jan-PhilipGehrcke already pointed out. To expand on his recommendation: The classical variant:
def parse(self, file_addr):
with open(file_addr, 'r') as f:
line_no = 0
for line in f:
line_no += 1
if line_no % 2 != 0:
packet = parse_packet(line)
if found_matched(packet):
# count
self.save_packet(packet)
Or using your style (assuming you use python 3):
def parse(self, file_addr):
with open(file_addr, 'r') as f:
filtered = (l for index,l in enumerate(f) if index % 2 != 0)
for line in filtered:
# and so on
The thing to notice here, is the use of iterators, all operations to build the filtered list (which is not actually a list!!) operate on and return iterators, which means that at no point the entire file is loaded into a list.
Upvotes: 1
Reputation: 35826
First of all, "about 650 files which takes about 320 M" is not a lot. Given that modern hard disks easily read and write 100 MB/s, the I/O performance of your system probably is not your bottleneck (also supported by "it just took about 0.3 seconds to run the script for one single file with 2000 lines", which clearly indicates CPU-limitation). However, the exact way you are reading files from within Python may not be efficient.
Furthermore, a simple multiprocessing
-based architecture, run on a common multi core system, will allow you to perform your analysis much faster (no need to involve celery here, no need to cross machine boundaries).
Just have a look at multiprocessing
, your architecture likely will involve one manager process (the parent), which defines a task Queue
, and a Pool
of worker processes. The manager (or feeder) puts tasks (e.g. file names) into the queue, and the workers consume these. After finishing with a task, a worker lets the manager know, and proceeds consuming the next one.
This is quite inefficient:
lines = [l for index, l in enumerate(open(file_addr, 'r').readlines()) if index % 2 != 0]
for line in lines:
...
readlines()
reads the entire file before the list comprehension is evaluated. Only after that you again iterate through all lines. Hence, you iterate three times through your data. Combine everything into a single loop, so that you iterate the lines only once.
Upvotes: 3
Reputation: 31339
You should be using threads here. If you're blocked by cpu later, you can use processes.
To explain I first created a ten thousand files (0.txt
... 9999.txt
), with a line count that's equivalent to the name (+1), using this command:
for i in `seq 0 999`; do for j in `seq 0 $i`; do echo $i >> $i.txt; done ; done
Next, I've created a python script using a ThreadPool with 10 threads to count the lines of all files that have an even value:
#!/usr/bin/env python
from multiprocessing.pool import ThreadPool
import time
import sys
print "creating %s threads" % sys.argv[1]
thread_pool = ThreadPool(int(sys.argv[1]))
files = ["%d.txt" % i for i in range(1000)]
def count_even_value_lines(filename):
with open(filename, 'r') as f:
# do some processing
line_count = 0
for line in f.readlines():
if int(line.strip()) % 2 == 0:
line_count += 1
print "finished file %s" % filename
return line_count
start = time.time()
print sum(thread_pool.map(count_even_value_lines, files))
total = time.time() - start
print total
As you can see this takes no time, and the results are correct. 10 files are processed in parallel and the cpu is fast enough to handle the results. If you want even more you may consider using threads and processes to utilize all cpus as well as not letting IO block you.
As comments suggest, I was wrong and this is not I/O blocked, so you can speed it up using multiprocessing (cpu blocked). Because I used a ThreadPool which has the same interface as Pool you can make minimal edits and have the same code running:
#!/usr/bin/env python
import multiprocessing
import time
import sys
files = ["%d.txt" % i for i in range(2000)]
# function has to be defined before pool is opened and workers are forked
def count_even_value_lines(filename):
with open(filename, 'r') as f:
# do some processing
line_count = 0
for line in f:
if int(line.strip()) % 2 == 0:
line_count += 1
return line_count
print "creating %s processes" % sys.argv[1]
process_pool = multiprocessing.Pool(int(sys.argv[1]))
start = time.time()
print sum(process_pool.map(count_even_value_lines, files))
total = time.time() - start
print total
Results:
me@EliteBook-8470p:~/Desktop/tp$ python tp.py 1
creating 1 processes
25000000
21.2642059326
me@EliteBook-8470p:~/Desktop/tp$ python tp.py 10
creating 10 processes
25000000
12.4360249043
Upvotes: 1