Reputation: 639
Being new to python I was tasked to find the fastest way to parse large log files in Python.
These are the methods I have tried so far and they give me between 33 to 43 seconds of processing time. This one took the longest at 43 secs:
tflines1 = tfile.readlines()
time_data_count = 0
for line in tflines1:
if 'TIME_DATA' in line :
time_data_count += 1
if time_data_count > 20 :
print("time_data complete")
else:
print("incomplete time_data data")
This one took 34 secs on avg:
with open(filename) as f:
time_data_count = 0
while True:
memcap = f.read(102400)
memcaplist = memcap.split("\n")
for line in memcaplist:
if 'TIME_DATA' in line:
time_data_count += 1
if not memcap:
break
This one averaged 36 seconds:
with open(filename, 'r', buffering=102400) as f:
time_data_count = 0
for line in f:
if 'TIME_DATA' in line:
time_data_count += 1
This one averaged 36 secs:
logfile = open(filename)
time_data_count = 0
for line in logfile:
if 'TIME_DATA' in line:
time_data_count += 1
The fastest was this one which did the task in 26.8 seconds And I don't know why it is the fastest. I don't get what makes it so special. With this one and others like it where I specified the bytes, I am worried that there might be a file or two where a line is split in between byte chunks and the string I am looking for gets split in half. This would result in an erroneous count. How would one address this:
with open(filename) as f:
time_data_count = 0
while True:
memcap = f.read(102400)
time_data_count += memcap.count('TIME_DATA')
if not memcap:
break
if time_data_count > 20:
print("time_data complete")
else:
print("incomplete time_data data")
Anyways, the boss man told me to look into other ways that might make things faster. He suggested list comprehension and reading a file object as binary. I don't think list comprehension will help much and for being able to extract data I need from file.I feel like reading the file as binary would create extra lines of code and issues where the code needs to know when to anticipate certain character. Would even reading a file as binary make a difference? This isn't c++ where you can use pointers.
I briefly read up on parallelization but I am not sure if that would work for our use case. I need to track how many times certain strings show up so I am not sure how splitting a file into different threads would work when you want to keep track of a count of things. Is this even possible?
@TimPeters here is an edit of the final method using binary file and seek():
with open(filename, 'rb') as f:
time_data_count = 0
while True:
memcap = f.read(102400)
f.seek(-tdatlength, 1)
time_data_count += memcap.count(b'TIME_DATA')
if not memcap:
break
if time_data_count > 20:
print("time_data complete")
else:
print("incomplete time_data data")
Another method I tried:
with open(filename, 'rb', buffering=102400) as f:
time_data_count = 0
#ask tenzin about seek in this situation
for line in f:
if b'TIME_DATA' in line:
time_data_count += 1
f.seek(-tdatlength, 2)
if time_data_count > 20:
print("time_data complete")
else:
print("incomplete time_data data")
print(time_data_count)
Upvotes: 1
Views: 1241
Reputation: 70602
Hints aren't working here, so fleshing out the check-for-overlaps idea in a concrete way equivalent to - but not using - .seek()
.
target = b"TIME_DATA"
windowsize = len(target) - 1
last = b""
target_count = 0
with open(filename, "rb") as f:
while True:
memcap = f.read(102400)
if not memcap:
break
overlap = last[-windowsize :] + memcap[: windowsize]
if target in overlap:
target_count += 1
target_count += memcap.count(target)
last = memcap
windowsize
is one too small for target
to match in the portion of overlap
taken from last
, or the one from memcap
, so if target
is found in overlap
it must match at least one character in each of the pieces: it really is an overlapping match. In the other direction, if there is match across adjacent chunks, it must start at one of the last windowsize
characters of last
and end at one of the first windowsize
characters of memcap
, so windowsize
is big enough to find any such match.
EDIT: repaired a too-strong claim below.
There's one area of ambiguity: if some prefix of target
is also a suffix of target
, matches can overlap. For example, "abab" has "ab" as both a prefix and a suffix. So if one chunk ends with, and the next chunk starts with, "abab", overlap
will be "bababa". Do you, or do you not, want to count the "abab" in the middle of that? That is, does "abab" occur 2 or 3 times in "abababab"? The code above says 3.
But that ambiguity doesn't arise for targets - like "TIME_DATA" - where no prefix is equal to a suffix. It could arise for, e.g., "ATIME_DATA" (with "A" both a suffix and prefix): does that appear once or twice in "ATIME_DATATIME_DATA"? The code above can say "twice" if it's split across chunks like, e.g., "...ATIME_DATATI" and "ME_DATA...".
If you care about that, it could be addressed by doing brief searches to ensure that a match in the pasted-together segment doesn't overlap with a match near the end of the left chunk or near the start of the right chunk.
Upvotes: 1
Reputation: 4014
Here's using mmap to improve I/O performance. I can't benchmark it without your log files, but I believe it will be a lot faster than line-based I/O.
If you wish to parallelize it, you could expand on this as the mmap object supports random access (see seek()
). So you could spin up multiple threads starting the search from several points in the file at the same time.
import mmap
import sys
with open(sys.argv[1], 'rb') as f:
mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
target = b'TIME_DATA'
tl = len(target)
idx = -tl
counter = 0
while True:
idx = mm.find(target, idx + tl)
if idx < 0:
break
counter += 1
print(counter)
mm.close()
Upvotes: 1
Reputation: 4033
You could parse multiple log files in parallel using python's multiprocessing
library:
import multiprocessing
def process_log(filename):
with open(filename) as f:
time_data_count = 0
while True:
memcap = f.read(102400)
time_data_count += memcap.count('TIME_DATA')
if not memcap:
break
if time_data_count > 20:
print("time_data complete")
return True
else:
print("incomplete time_data data")
return False
filepaths = # load all the filepaths here with something like glob.glob("path/to/logdir/*.log)
pool = multiprocessing.Pool(num_cores_to_use) # set num cores to use
number_of_complete_logs = 0
for complete in pool.imap_unordered(process_log, filepaths):
if complete:
number_of_complete_logs += 1
print(number_of_complete_logs)
This way, if you have 4 cores, you would process 4 log files in the time of 1. Plus, each file is worked on separately, so the TIME_DATA
counter stays intact.
If you're processing a lot of logfiles, I recommend using tqdm
:
for complete in tqdm(pool.imap_unordered(process_log, filepaths), total=len(filepaths)):
This way you can track the progress and estimate how long the entire operation will take.
Upvotes: 2