Reputation: 59933
I have lots of log files, and want to search some patterns using multiline, but in order to locate matched string easily, I still want to see the line number for matched area.
Any good suggestion. (code sample is copied)
string="""
####1
ttteest
####1
ttttteeeestt
####2
ttest
####2
"""
import re
pattern = '.*?####(.*?)####'
matches= re.compile(pattern, re.MULTILINE|re.DOTALL).findall(string)
for item in matches:
print "lineno: ?", "matched: ", item
[UPDATE] the lineno is the actual line number
So the output I want looks like:
lineno: 1, 1
ttteest
lineno: 6, 2
ttttteeeestt
Upvotes: 24
Views: 25040
Reputation: 47978
Out of the box the re
module doesn't provide a way to do this (at least not trivially).
This can be done fairly efficiently by:
{offset: line_number}
mapping up until the last match.This avoids counting back to the beginning of the file for every match.
The following function is similar to re.finditer
def finditer_with_line_numbers(pattern, string, flags=0):
"""
A version of ``re.finditer`` that returns ``(match, line_number)`` pairs.
"""
import re
matches = list(re.finditer(pattern, string, flags))
if matches:
end = matches[-1].start()
# -1 so a failed `rfind` maps to the first line.
newline_table = {-1: 0}
for i, m in enumerate(re.finditer("\\n", string), 1):
# Don't find newlines past our last match.
offset = m.start()
if offset > end:
break
newline_table[offset] = i
# Failing to find the newline is OK, -1 maps to 0.
for m in matches:
newline_offset = string.rfind("\n", 0, m.start())
line_number = newline_table[newline_offset]
yield (m, line_number)
If you want the contents, you can replace the last loop with:
for m in matches:
newline_offset = string.rfind("\n", 0, m.start())
newline_end = string.find('\n', m.end()) # '-1' gracefully uses the end.
line = string[newline_offset + 1:newline_end]
line_number = newline_table[newline_offset]
yield (m, line_number, line)
Note that it would be nice to avoid having to create a list from finditer
, however that means we won't know when to stop storing newlines (where it could end up storing many newlines even if the only pattern match is at the beginning of the file).
If it was important to avoid storing all matches - it's possible to make an iterator that scans newlines as-needed, though not sure this would give you much advantage in practice.
Upvotes: 12
Reputation: 6187
I believe this does more or less what you want:
import re
string="""
####1
ttteest
####1
ttttteeeestt
####2
ttest
####2
"""
pattern = '.*?####(.*?)####'
matches = re.compile(pattern, re.MULTILINE|re.DOTALL)
for match in matches.finditer(string):
start, end = string[0:match.start()].count("\n"), string[0:match.end()].count("\n")
print("lineno: %d-%d matched: %s" % (start, end, match.group()))
It might be a little slower than other options because it repeatedly does a substring match and search on the string, but since the string is small in your example, i think it's worth the tradeoff for simplicity.
What we gain here is also the range of lines that match the pattern, which allows us to extract the whole string in one swoop as well. We might optimize this further by counting the number of newlines in the match instead of going straight for the end, for what it's worth.
Upvotes: 1
Reputation: 1416
The finditer function can tell you the character range that matched. From this you can use a simple newline regular expression to count how many newlines were before the match. Add one to the number of newlines to get the line number, as our convention in manipulating text in an editor is to call the first line 1 rather than 0.
def multiline_re_with_linenumber():
string="""
####1
ttteest
####1
ttttteeeestt
####2
ttest
####2
"""
re_pattern = re.compile(r'.*?####(.*?)####', re.DOTALL)
re_newline = re.compile(r'\n')
count = 0
for m in re_pattern.finditer(string):
count += 1
start_line = len(re_newline.findall(string, 0, m.start(1)))+1
end_line = len(re_newline.findall(string, 0, m.end(1)))+1
print ('"""{}"""\nstart={}, end={}, instance={}'.format(m.group(1), start_line, end_line, count))
Gives this output
"""1
ttteest
"""
start=2, end=4, instance=1
"""2
ttest
"""
start=7, end=10, instance=2
Upvotes: 2
Reputation: 190
You can store the line numbers before hand only and afterwards look for it.
import re
string="""
####1
ttteest
####1
ttttteeeestt
####2
ttest
####2
"""
end='.*\n'
line=[]
for m in re.finditer(end, string):
line.append(m.end())
pattern = '.*?####(.*?)####'
match=re.compile(pattern, re.MULTILINE|re.DOTALL)
for m in re.finditer(match, string):
print 'lineno :%d, %s' %(next(i for i in range(len(line)) if line[i]>m.start(1)), m.group(1))
Upvotes: 7
Reputation: 27575
import re
text = """
####1
ttteest
####1
ttttteeeestt
####2
ttest
####2
"""
pat = ('^####(\d+)'
'(?:[^\S\n]*\n)*'
'\s*(.+?)\s*\n'
'^####\\1(?=\D)')
regx = re.compile(pat,re.MULTILINE)
print '\n'.join("lineno: %s matched: %s" % t
for t in regx.findall(text))
result
lineno: 1 matched: ttteest
lineno: 2 matched: ttest
Upvotes: -2
Reputation: 2553
What you want is a typical task that regex is not very good at; parsing.
You could read the logfile line by line, and search that line for the strings you are using to delimit your search. You could use regex line by line, but it is less efficient than regular string matching unless you are looking for complicated patterns.
And if you are looking for complicated matches, I'd like to see it. Searching every line in a file for ####
while maintaining the line count is easier without regex.
Upvotes: 8