Reputation: 862
I have a 20MB 1 million line file in the following format:
# REG A
TextToParse1
TextToParse2
...
...
...
TestToParseX
# reg A
# REG B
TextToParse1
TextToParse2
...
...
...
TestToParseX
# reg B
(continued)
about 20k blocks in the mentioned format.
I perform lookups in the file using a list REG Z, REG YYY, REG C, REG ASDSX (order is random)
. On each iteration I capture the relevant text between # REG X
and # reg X
, process it, and continue to the next one in list. I'm looking for the fastest method to achieve it.
I went the regex approach. I timed a single lookup and my measurements show that:
start = timer()
pattern = r"(# REG {0})(.*)(# reg {0})".format(reg_name)
match = re.search(pattern, file, re.DOTALL)
end = timer()
is 0.2 seconds. This times 20k is very slow.
Upvotes: 3
Views: 3241
Reputation: 89547
You can do something like that:
with open('file.txt') as fh:
for line in fh:
if line.startswith('# REG '):
reg = line.split()[2]
blocklist = []
for line in fh:
if line.startswith('# reg '):
# do what you need here
# print(reg)
# print(block)
block = ''.join(blocklist)
break
blocklist.append(line)
(Feel free to make a generator with it)
using itertools:
from itertools import takewhile
with open('file.txt') as fh:
for line in fh:
if line.startswith('# REG '):
reg = line.split()[2]
block = ''.join(takewhile(lambda x: not(x.startswith('# reg ')), fh))
# do what you want here
# print(reg)
# print(block)
using regex:
import re
with open('file.txt') as fh:
blocks = re.findall(r'(?m)^# REG (.*)\n((?:.*\n)*?)# reg ', fh.read())
Upvotes: 2
Reputation: 626689
The pattern with .*
implies some backtracking, and the amount of backtracking depends on how long the text is, whether or not you used the DOTALL modifier or not, whether there is a match or not. You enabled DOTALL mode, so, once the # REG A
is found, the regex engine grabs the whole text with .*
and starts backtracking in search for the end delimiter, # reg A
. It might be a long way before the text is found.
What can be done? If your file is properly formatted, and your blocks are short (from the start till end delimiters), it should be enough to use lazy dot matching:
pattern = r"# REG {0}(.*?)# reg {0}".format(reg_name)
This should be still used with re.DOTALL
.
If the blocks are very long, lazy dot matching loses in performance to unrolled patterns:
pattern = r'# REG {0}([^#]*(?:#(?! reg {0})[^#]*)*)'
See the regex demo
# REG {0}
- the start delimiter pattern([^#]*(?:#(?! reg {0})[^#]*)*)
- Group 1
[^#]*
- zero or more non-#
(?:#(?! reg {0})[^#]*)*
- zero or more sequences of
#(?! reg {0})
- a #
char not followed with space+reg
+space+name[^#]*
- zero or more non-#
This way, we get to the trailing delimiter by consuming the chunks not matching the trailing delimiter in a linear way.
If the delimiters are always at the start of the lines, you could use (?m)^# REG {0}(.*(?:\r?\n(?!# reg {0}).*)*)
regex using the same technique.
Upvotes: 4
Reputation: 24547
If you're just looking for lines that start with either # REG
or # reg
, there's no need to use regular expressions at all. This should suffice:
def loadmyfile(filename):
reg = ""
nlp = 0
for line in open(filename,"r"):
if line[:6] == "# REG ":
reg = line[6:]
elif line[:6] == "# reg ":
reg = ""
else:
# (Process the data here)
nlp += 1
print "Number of lines processed: %d" % nlp
Upvotes: 1