John Schmitt
John Schmitt

Reputation: 1218

Multi-line regex fails to match even though individual items to

I'm trying to search through a bunch of large text files for specific information.

#!/usr/bin/env python
# pythnon 3.4
import re
sometext = """
    lots
    of
    text here
    Sentinel starts
    --------------------
    item_one               item_one_result
    item_two               item_two_result
    --------------------
    lots
    more
    text here
    Sentinel starts
    --------------------
    item_three               item_three_result
    item_four                item_four_result
    item_five                item_five_result
    --------------------
    even
    more
    text here
    Sentinel starts
    --------------------
    item_six                item_six_result
    --------------------
    """
sometextpattern = re.compile( '''.*Sentinel\s+starts.*$                           # sentinel
                                 ^.*-+.*$                                         # dividing line
                                 ^.*\s+(?P<itemname>\w+)\s+(?P<itemvalue>\w+)\s+  # item details
                                 ^.*-+.*$                                         # dividing line                                  
                              ''', flags = re.MULTILINE | re.VERBOSE)
print( re.findall( sometextpattern, sometext ) )

Individually, the sentinels and dividing lines match on their own. How do I make this work together? i.e. I would like this to print:

[('item_one','item_one_result'),('item_two','item_two_result'),('item_three','item_three_result'),('item_four','item_four_result'),('item_five','item_five_results'),('item_six','item_six_results')]

Upvotes: 0

Views: 367

Answers (3)

chapelo
chapelo

Reputation: 2562

Try these regex:

for m in re.findall(r'(?:Sentinel starts\n[-\n]*)([^-]+)', sometext, flags=re.M ):
    print(list(re.findall(r'(\w+)\s+(\w+)', m)))

It gives you a list of key,value tuples:

# [('item_one', 'item_one_result'), ('item_two', 'item_two_result')]
# [('item_three', 'item_three_result'), ('item_four', 'item_four_result')]

Because the text has trailing spaces, change the regex in the for statement for this one:

r'(?:Sentinel starts\s+-*)([^-]*\b)'

Upvotes: 1

Avinash Raj
Avinash Raj

Reputation: 174696

Use four capturing groups in-order to print the text you want inside the list.

>>> import regex
>>> text = """    lots
    of
    text here
    Sentinel starts
    --------------------
    item_one               item_one_result
    item_two               item_two_result
    --------------------
    lots
    more
    text here
    Sentinel starts
    --------------------
    item_three               item_three_result
    item_four                item_four_result
    item_five                item_five_result
    --------------------
    even
    more
    text here
    Sentinel starts
    --------------------
    item_six                item_six_result
    --------------------"""
>>> regex.findall(r'(?:(?:\bSentinel starts\s*\n\s*-+\n\s*|-+)|(?<!^)\G) *(\w+) *(\w+)\n*', text)
[('item_one', 'item_one_result'), ('item_two', 'item_two_result'), ('item_three', 'item_three_result'), ('item_four', 'item_four_result'), ('item_five', 'item_five_result'), ('item_six', 'item_six_result')]

\s* matches zero or more space characters and \S+ matches one or more non-space characters. \G assert position at the end of the previous match or the start of the string for the first match.

DEMO

Upvotes: 1

Strikeskids
Strikeskids

Reputation: 4052

The regex multiline matching tag only makes ^ and $ match the beginning and end of each line, respectively. If you want to match multiple lines, you will need to add a whitespace meta character '\\s' to match the newline.

.*Sentinel\s+starts.*$\s
^.*-+.*$\s
^.*\s+(?P<itemname>\w+)\s+(?P<itemvalue>\w+)\s+
^.*-+.*$

Regular expression visualization

Debuggex Demo


Also the string you are using does not have the required string escaping. I would recommend using the r'' type string instead. That way you do not have to escape your backslashes.

Upvotes: 1

Related Questions