realp
realp

Reputation: 627

Efficiently parsing text file for datetimes

I have text files that look like something like

<Jun/11 09:14 pm>Information i need to capture1
<Jun/11 09:14 pm> Information i need to capture2

<Jun/11 09:14 pm> Information i need to capture3
<Jun/11 09:14 pm> Information i need to capture4
<Jun/11 09:15 pm> Information i need to capture5
<Jun/11 09:15 pm> Information i need to capture6

and two datetimes like

15/6/2015-16:27:10  # startDateTime
15/6/2015-17:27:19  # endDateTime

I need to grab all the information in the logs between the two datetimes. Currently I make a datetime object from each the two times im searching between.

I then read the file line by line and make a new datetime object that I compare against my start and end time to see if i should grab that line of information. However the files are huge(150MB) and the code can take hours to run(On 100+ files).

The code looks something like

f = open(fileToParse, "r")
for line in f.read().splitlines():
    if line.strip() == "":
        continue
    lineDateTime = datetime.datetime(lineYear, lineMonth, lineDay, lineHour, lineMin, lineSec)
    if (startDateTime < lineDateTime < endDateTime):
        writeFile.write(line+"\n")
        between = True
    elif(lineDateTime > endDateTime):
        writeFile.write(line+"\n")
        break
    else:
        if between:
            writeFile.write(line+"\n")

I want to rewrite this using some more smarts. The files can hold months of information, however I usually only search for about 1 hour to 3 days of data.

Upvotes: 2

Views: 169

Answers (1)

Padraic Cunningham
Padraic Cunningham

Reputation: 180461

You are reading all the file into memory regardless, just iterate over the file object and break when the date is beyond your upper limit:

with  open(fileToParse, "r") as f:
    for line in f:
        if not line.strip():
            continue
        lineDateTime = datetime.datetime(lineYear, lineMonth, lineDay, lineHour, lineMin, lineSec)
        if startDateTime < lineDateTime < endDateTime:
            writeFile.write(line + "\n")
        elif lineDateTime > endDateTime:
            break

Obviously you need to get lineYear, lineMonth etc..

using f.read().splitlines() not only reads all the lines into memory so if 5 lines in you are above the upper limit you still have all the lines in memory, you also split the lines so you create a full list of all the lines also.

You could also check the month/year are correct and only create datetime objects if you had the correct month/year which would be a lot faster.

If your lines started as above:

Jun/11 

And you wanted Jun/11 then simply if line.startswith("Jun/11") and only then start creating datetime objects.

with open(fileToParse, "r") as f:
    for line in f:
        if line.startswith("Jun/11"):
            for line in f:
                try:
                    lineDateTime = datetime.datetime...
                except ValueError:
                    continue
                if startDateTime < lineDateTime < endDateTime:
                    writeFile.write(line + "\n")
                elif lineDateTime > endDateTime:
                    break

Upvotes: 2

Related Questions