susdu
susdu

Reputation: 862

fastest way to parse large text in python

I have a 20MB 1 million line file in the following format:

# REG A
TextToParse1
TextToParse2
...
...
...
TestToParseX
# reg A
# REG B
TextToParse1
TextToParse2
...
...
...
TestToParseX
# reg B
(continued)

about 20k blocks in the mentioned format. I perform lookups in the file using a list REG Z, REG YYY, REG C, REG ASDSX (order is random). On each iteration I capture the relevant text between # REG X and # reg X, process it, and continue to the next one in list. I'm looking for the fastest method to achieve it.

I went the regex approach. I timed a single lookup and my measurements show that:

start = timer()
pattern = r"(# REG {0})(.*)(# reg {0})".format(reg_name)
match = re.search(pattern, file, re.DOTALL)
end = timer()

is 0.2 seconds. This times 20k is very slow.

Upvotes: 3

Views: 3241

Answers (3)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89547

You can do something like that:

with open('file.txt') as fh:
    for line in fh:
        if line.startswith('# REG '):
            reg = line.split()[2]
            blocklist = []
            for line in fh:
                if line.startswith('# reg '):
                    # do what you need here
                    # print(reg)
                    # print(block)
                    block = ''.join(blocklist)
                    break
                blocklist.append(line)

(Feel free to make a generator with it)

using itertools:

from itertools import takewhile

with open('file.txt') as fh:
    for line in fh:
        if line.startswith('# REG '):
            reg = line.split()[2]
            block = ''.join(takewhile(lambda x: not(x.startswith('# reg ')), fh))
            # do what you want here
            # print(reg)
            # print(block)

using regex:

import re

with open('file.txt') as fh:
    blocks = re.findall(r'(?m)^# REG (.*)\n((?:.*\n)*?)# reg ', fh.read())

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626689

The pattern with .* implies some backtracking, and the amount of backtracking depends on how long the text is, whether or not you used the DOTALL modifier or not, whether there is a match or not. You enabled DOTALL mode, so, once the # REG A is found, the regex engine grabs the whole text with .* and starts backtracking in search for the end delimiter, # reg A. It might be a long way before the text is found.

What can be done? If your file is properly formatted, and your blocks are short (from the start till end delimiters), it should be enough to use lazy dot matching:

pattern = r"# REG {0}(.*?)# reg {0}".format(reg_name)

This should be still used with re.DOTALL.

If the blocks are very long, lazy dot matching loses in performance to unrolled patterns:

pattern = r'# REG {0}([^#]*(?:#(?! reg {0})[^#]*)*)'

See the regex demo

  • # REG {0} - the start delimiter pattern
  • ([^#]*(?:#(?! reg {0})[^#]*)*) - Group 1
    • [^#]* - zero or more non-#
    • (?:#(?! reg {0})[^#]*)* - zero or more sequences of
      • #(?! reg {0}) - a # char not followed with space+reg+space+name
      • [^#]* - zero or more non-#

This way, we get to the trailing delimiter by consuming the chunks not matching the trailing delimiter in a linear way.

If the delimiters are always at the start of the lines, you could use (?m)^# REG {0}(.*(?:\r?\n(?!# reg {0}).*)*) regex using the same technique.

Upvotes: 4

r3mainer
r3mainer

Reputation: 24547

If you're just looking for lines that start with either # REG or # reg, there's no need to use regular expressions at all. This should suffice:

def loadmyfile(filename):
    reg = ""
    nlp = 0
    for line in open(filename,"r"):
        if line[:6] == "# REG ":
            reg = line[6:]
        elif line[:6] == "# reg ":
            reg = ""
        else:
            # (Process the data here)
            nlp += 1
    print "Number of lines processed: %d" % nlp

Upvotes: 1

Related Questions