edo101
edo101

Reputation: 639

Fastest way to read Large (>5GB) log files with inbuilt funcs and parallelization?

Being new to python I was tasked to find the fastest way to parse large log files in Python.

These are the methods I have tried so far and they give me between 33 to 43 seconds of processing time. This one took the longest at 43 secs:

tflines1 = tfile.readlines()

time_data_count = 0
for line in tflines1:
    if 'TIME_DATA' in line :
        time_data_count += 1
if time_data_count > 20 :
    print("time_data complete")
else:
    print("incomplete time_data data")

This one took 34 secs on avg:

with open(filename) as f:
    time_data_count = 0
    while True:
        memcap = f.read(102400)
        memcaplist = memcap.split("\n")
        for line in memcaplist:
            if 'TIME_DATA' in line:
                time_data_count += 1
        if not memcap:
            break

This one averaged 36 seconds:

with open(filename, 'r', buffering=102400) as f:
    time_data_count = 0
    for line in f:
        if 'TIME_DATA' in line:
            time_data_count += 1

This one averaged 36 secs:

logfile = open(filename)
time_data_count = 0
for line in logfile:
    if 'TIME_DATA' in line:
            time_data_count += 1

The fastest was this one which did the task in 26.8 seconds And I don't know why it is the fastest. I don't get what makes it so special. With this one and others like it where I specified the bytes, I am worried that there might be a file or two where a line is split in between byte chunks and the string I am looking for gets split in half. This would result in an erroneous count. How would one address this:

with open(filename) as f:
    time_data_count = 0
    while True:
        memcap = f.read(102400)
        time_data_count += memcap.count('TIME_DATA')
        if not memcap:
            break
    if time_data_count > 20:
        print("time_data complete")
    else:
        print("incomplete time_data data")

Anyways, the boss man told me to look into other ways that might make things faster. He suggested list comprehension and reading a file object as binary. I don't think list comprehension will help much and for being able to extract data I need from file.I feel like reading the file as binary would create extra lines of code and issues where the code needs to know when to anticipate certain character. Would even reading a file as binary make a difference? This isn't c++ where you can use pointers.

I briefly read up on parallelization but I am not sure if that would work for our use case. I need to track how many times certain strings show up so I am not sure how splitting a file into different threads would work when you want to keep track of a count of things. Is this even possible?

@TimPeters here is an edit of the final method using binary file and seek():

with open(filename, 'rb') as f:
    time_data_count = 0
    while True:

        memcap = f.read(102400)
        f.seek(-tdatlength, 1)
        time_data_count += memcap.count(b'TIME_DATA')

        if not memcap:
            break
    if time_data_count > 20:
        print("time_data complete")
    else:
        print("incomplete time_data data")

Another method I tried:

with open(filename, 'rb', buffering=102400) as f:
    time_data_count = 0
      #ask tenzin about seek in this situation
    for line in f:
        if b'TIME_DATA' in line:
            time_data_count += 1
    f.seek(-tdatlength, 2)
    if time_data_count > 20:
        print("time_data complete")
    else:
        print("incomplete time_data data")
    print(time_data_count)

Upvotes: 1

Views: 1241

Answers (3)

Tim Peters
Tim Peters

Reputation: 70602

Hints aren't working here, so fleshing out the check-for-overlaps idea in a concrete way equivalent to - but not using - .seek().

target = b"TIME_DATA"
windowsize = len(target) - 1
last = b""
target_count = 0
with open(filename, "rb") as f:
    while True:
        memcap = f.read(102400)
        if not memcap:
            break
        overlap = last[-windowsize :] + memcap[: windowsize]
        if target in overlap:
            target_count += 1
        target_count += memcap.count(target)
        last = memcap

windowsize is one too small for target to match in the portion of overlap taken from last, or the one from memcap, so if target is found in overlap it must match at least one character in each of the pieces: it really is an overlapping match. In the other direction, if there is match across adjacent chunks, it must start at one of the last windowsize characters of last and end at one of the first windowsize characters of memcap, so windowsize is big enough to find any such match.

EDIT: repaired a too-strong claim below.

There's one area of ambiguity: if some prefix of target is also a suffix of target, matches can overlap. For example, "abab" has "ab" as both a prefix and a suffix. So if one chunk ends with, and the next chunk starts with, "abab", overlap will be "bababa". Do you, or do you not, want to count the "abab" in the middle of that? That is, does "abab" occur 2 or 3 times in "abababab"? The code above says 3.

But that ambiguity doesn't arise for targets - like "TIME_DATA" - where no prefix is equal to a suffix. It could arise for, e.g., "ATIME_DATA" (with "A" both a suffix and prefix): does that appear once or twice in "ATIME_DATATIME_DATA"? The code above can say "twice" if it's split across chunks like, e.g., "...ATIME_DATATI" and "ME_DATA...".

If you care about that, it could be addressed by doing brief searches to ensure that a match in the pasted-together segment doesn't overlap with a match near the end of the left chunk or near the start of the right chunk.

Upvotes: 1

jingx
jingx

Reputation: 4014

Here's using mmap to improve I/O performance. I can't benchmark it without your log files, but I believe it will be a lot faster than line-based I/O.

If you wish to parallelize it, you could expand on this as the mmap object supports random access (see seek()). So you could spin up multiple threads starting the search from several points in the file at the same time.

import mmap
import sys

with open(sys.argv[1], 'rb') as f:
    mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
    target = b'TIME_DATA'
    tl = len(target)
    idx = -tl
    counter = 0
    while True:
        idx = mm.find(target, idx + tl)
        if idx < 0:
            break
        counter += 1
    print(counter)
    mm.close()

Upvotes: 1

Jay Mody
Jay Mody

Reputation: 4033

You could parse multiple log files in parallel using python's multiprocessing library:

import multiprocessing

def process_log(filename):
    with open(filename) as f:
        time_data_count = 0
        while True:
            memcap = f.read(102400)
            time_data_count += memcap.count('TIME_DATA')
            if not memcap:
                break
        if time_data_count > 20:
            print("time_data complete")
            return True
        else:
            print("incomplete time_data data")
            return False

filepaths = # load all the filepaths here with something like glob.glob("path/to/logdir/*.log)
pool = multiprocessing.Pool(num_cores_to_use) # set num cores to use

number_of_complete_logs = 0
for complete in pool.imap_unordered(process_log, filepaths):
    if complete:
        number_of_complete_logs += 1

print(number_of_complete_logs)

This way, if you have 4 cores, you would process 4 log files in the time of 1. Plus, each file is worked on separately, so the TIME_DATA counter stays intact.

If you're processing a lot of logfiles, I recommend using tqdm:

for complete in tqdm(pool.imap_unordered(process_log, filepaths), total=len(filepaths)):

This way you can track the progress and estimate how long the entire operation will take.

Upvotes: 2

Related Questions