madu
madu

Reputation: 5450

Most efficient way to parse a file in Python

I want to know what the most efficient way is to parse a text file. For example, lets say I have the following text file:

Number of connections server is: 1

Server status is: ACTIVE

Number of connections to server is: 4

Server status is: ACTIVE

Server is not responding: 13:25:03

Server connection is established: 13:27:05

What I want to do is to go through the file and gather information. For example, number of connections to the server, or the times the server went down. I want to save these values in maybe lists, so that I can view or plot them later.

So what is the best way to perform this, assuming I have my keywords in a list as follows:

referenceLines = ['connections server', 'Server status', 'not responding']

Note that I do not have the complete sentence in the list, but only a part of it. I want to go through the file, line-by-line, and check if the read line corresponds to any entry in the referenceLines list, if so, get the index of the list entry and call the corresponding function.

What would be the most efficient (time, memory) way to do this, as a typical text file will be about 50MB in size.

Thank you.

Any

Upvotes: 1

Views: 3128

Answers (4)

Janne Karila
Janne Karila

Reputation: 25197

Here's one possible approach. It uses a regular expression pattern of the form 'keyword1|keyword2' to search for multiple keywords at once.

def func1(line):
    #do something

def func2(line):
    #do something

actions = {'connections server': func1,
           'Server status': func2}

regex = re.compile('|'.join(re.escape(key) for key in actions))

for line in file:
    for matchobj in regex.finditer(line):
        actions[matchobj.group()](line)

Upvotes: 1

marbdq
marbdq

Reputation: 1235

If the text file you want to parse always contains the same fields in the same order, then mikerobi's solution is good. Otherwise, you need to iterate through the lines and try detecting referenceLines...

Upvotes: 1

rupello
rupello

Reputation: 8491

As a practical approach, I suggest that you implement this in a series of steps while measuring the performance at each step to gauge the cost of the approach you are using with your test data.

For example:

  • How long does it take to simply read the file line by line?
  • How long if you split() each line?
  • How long if you run re.match() on each line?

The optimal solution will depend on your data, for example, how many reference lines your are using, but it should only take a few seconds on a modern machine

Upvotes: 1

mikerobi
mikerobi

Reputation: 20878

If every line is seperated by ": ", you can split the string.

message, value = line.split(': ', 1)

Upvotes: 4

Related Questions