Rajat Mitra
Rajat Mitra

Reputation: 169

How to find patterns in text spanning multiple lines?

I wank to look for indexed array elements that are grouped in collections (comma separated) and the search should result in something like this (see the file data example below) -

[    'foo[0]',     'foo[1]',     'foo[2]', ...,     'foo[10]']
['foobar0[0]', 'foobar0[1]', 'foobar0[2]'  ..., 'foobar0[98]']
[    'bas[0]',     'bas[1]',     'bas[2]'  ...,     'bas[99]']

I have a text file where these appear as a "collection" that span over multiple lines and the collections are grouped by {..} (as shown below) -

{foo[0], foo[1], foo[2], foo[3]...\n
foo[10]}, {fooba0[0], foobar0[1], foobar0[2],....\n
foobar0[98], foobar0[99]}, {bas[0], bas[1], bas[2]...\n
bas[99]}

The general expression I am using to search the array elements is -

re.findall('[a-z][A-Z]+[0-9]+\[[0-9]+\]', <list item>)

In yacc this would translate to something like -

array_element_token:     [a-z][A-Z]+[0-9]+\[[0-9]+\]
array_items_continued:   array_items_continued             |
                         array_element_token ',' 
arrays_items:            '{' array_items_continued array_element_token '},'

Build I am not sure how to create the recursive rule using python regular expressions.

Upvotes: 1

Views: 147

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627044

You may use

import re

s = r"""{foo[0], foo[1], foo[2], foo[3]...\n
foo[10]}, {fooba0[0], foobar0[1], foobar0[2],....\n
foobar0[98], foobar0[99]}, {bas[0], bas[1], bas[2]...\n
bas[99]}"""
results = []
matches = re.findall(r'{[^{}]*}', s)
for m in matches:
    results.append( re.findall(r'\w+\[\d+]', m) )

See the Python demo, results are [['foo[0]', 'foo[1]', 'foo[2]', 'foo[3]', 'foo[10]'], ['fooba0[0]', 'foobar0[1]', 'foobar0[2]', 'foobar0[98]', 'foobar0[99]'], ['bas[0]', 'bas[1]', 'bas[2]', 'bas[99]']].

The {[^{}]*} regex extracts all substrings between { and }, and then \w+\[\d+] extracts all substrings that match the following sequences:

  • \w+ - 1+ letters, digits, _ chars
  • \[ - a [ char
  • \d+ - 1+ digits
  • ] - a ] char.

Upvotes: 1

Related Questions