Larry Cai
Larry Cai

Reputation: 59933

python regex, match in multiline, but still want to get the line number

I have lots of log files, and want to search some patterns using multiline, but in order to locate matched string easily, I still want to see the line number for matched area.

Any good suggestion. (code sample is copied)

string="""
####1
ttteest
####1
ttttteeeestt

####2

ttest
####2
"""

import re
pattern = '.*?####(.*?)####'
matches= re.compile(pattern, re.MULTILINE|re.DOTALL).findall(string)
for item in matches:
    print "lineno: ?", "matched: ", item

[UPDATE] the lineno is the actual line number

So the output I want looks like:

    lineno: 1, 1
    ttteest
    lineno: 6, 2
    ttttteeeestt

Upvotes: 24

Views: 25040

Answers (6)

ideasman42
ideasman42

Reputation: 47978

Out of the box the re module doesn't provide a way to do this (at least not trivially).

This can be done fairly efficiently by:

  • Finding all matches
  • Looping over newlines, storing the:
    {offset: line_number} mapping up until the last match.
  • For each match:
    • Reverse find the offset of the first newline beforehand.
    • Look up it's line number in the map.

This avoids counting back to the beginning of the file for every match.

The following function is similar to re.finditer

def finditer_with_line_numbers(pattern, string, flags=0):
    """
    A version of ``re.finditer`` that returns ``(match, line_number)`` pairs.
    """
    import re

    matches = list(re.finditer(pattern, string, flags))
    if matches:
        end = matches[-1].start()
        # -1 so a failed `rfind` maps to the first line.
        newline_table = {-1: 0}
        for i, m in enumerate(re.finditer("\\n", string), 1):
            # Don't find newlines past our last match.
            offset = m.start()
            if offset > end:
                break
            newline_table[offset] = i

        # Failing to find the newline is OK, -1 maps to 0.
        for m in matches:
            newline_offset = string.rfind("\n", 0, m.start())
            line_number = newline_table[newline_offset]
            yield (m, line_number)

If you want the contents, you can replace the last loop with:

        for m in matches:
            newline_offset = string.rfind("\n", 0, m.start())
            newline_end = string.find('\n', m.end())  # '-1' gracefully uses the end.
            line = string[newline_offset + 1:newline_end]
            line_number = newline_table[newline_offset]
            yield (m, line_number, line)

Note that it would be nice to avoid having to create a list from finditer, however that means we won't know when to stop storing newlines (where it could end up storing many newlines even if the only pattern match is at the beginning of the file).

If it was important to avoid storing all matches - it's possible to make an iterator that scans newlines as-needed, though not sure this would give you much advantage in practice.

Upvotes: 12

anarcat
anarcat

Reputation: 6187

I believe this does more or less what you want:

import re

string="""
####1
ttteest
####1
ttttteeeestt

####2

ttest
####2
"""

pattern = '.*?####(.*?)####'
matches = re.compile(pattern, re.MULTILINE|re.DOTALL)
for match in matches.finditer(string):
    start, end = string[0:match.start()].count("\n"), string[0:match.end()].count("\n")
    print("lineno: %d-%d matched: %s" % (start, end, match.group()))

It might be a little slower than other options because it repeatedly does a substring match and search on the string, but since the string is small in your example, i think it's worth the tradeoff for simplicity.

What we gain here is also the range of lines that match the pattern, which allows us to extract the whole string in one swoop as well. We might optimize this further by counting the number of newlines in the match instead of going straight for the end, for what it's worth.

Upvotes: 1

ChipJust
ChipJust

Reputation: 1416

The finditer function can tell you the character range that matched. From this you can use a simple newline regular expression to count how many newlines were before the match. Add one to the number of newlines to get the line number, as our convention in manipulating text in an editor is to call the first line 1 rather than 0.

def multiline_re_with_linenumber():
    string="""
####1
ttteest
####1
ttttteeeestt

####2

ttest
####2
"""
    re_pattern = re.compile(r'.*?####(.*?)####', re.DOTALL)
    re_newline = re.compile(r'\n')
    count = 0
    for m in re_pattern.finditer(string):
        count += 1
        start_line = len(re_newline.findall(string, 0, m.start(1)))+1
        end_line = len(re_newline.findall(string, 0, m.end(1)))+1
        print ('"""{}"""\nstart={}, end={}, instance={}'.format(m.group(1), start_line, end_line, count))

Gives this output

"""1
ttteest
"""
start=2, end=4, instance=1
"""2

ttest
"""
start=7, end=10, instance=2

Upvotes: 2

himanshu shekhar
himanshu shekhar

Reputation: 190

You can store the line numbers before hand only and afterwards look for it.

import re

string="""
####1
ttteest
####1
ttttteeeestt

####2

ttest
####2
"""

end='.*\n'
line=[]
for m in re.finditer(end, string):
    line.append(m.end())

pattern = '.*?####(.*?)####'
match=re.compile(pattern, re.MULTILINE|re.DOTALL)
for m in re.finditer(match, string):
    print 'lineno :%d, %s' %(next(i for i in range(len(line)) if line[i]>m.start(1)), m.group(1))

Upvotes: 7

eyquem
eyquem

Reputation: 27575

import re

text = """
####1
ttteest
####1
ttttteeeestt

####2   

ttest
####2
"""

pat = ('^####(\d+)'
       '(?:[^\S\n]*\n)*'
       '\s*(.+?)\s*\n'
       '^####\\1(?=\D)')
regx = re.compile(pat,re.MULTILINE)

print '\n'.join("lineno: %s  matched: %s" % t
                for t in regx.findall(text))

result

lineno: 1  matched: ttteest
lineno: 2  matched: ttest

Upvotes: -2

melwil
melwil

Reputation: 2553

What you want is a typical task that regex is not very good at; parsing.

You could read the logfile line by line, and search that line for the strings you are using to delimit your search. You could use regex line by line, but it is less efficient than regular string matching unless you are looking for complicated patterns.

And if you are looking for complicated matches, I'd like to see it. Searching every line in a file for #### while maintaining the line count is easier without regex.

Upvotes: 8

Related Questions