TryToBeNice
TryToBeNice

Reputation: 323

Find repeated pattern in text with "Re" Python

Would anybody kindly help me in the following example (If I use re.DOTALL, it reads until end of the file):

import re

text = "Found to A from:\n\t-B\n\t-C\nFound to K from:\n\t-B\n\t-D\n\t-E\nFound to A from:\n\t-D\nMax time: 20s"

names = ['A', 'K']
for name in names:
    print name
    print re.findall("Found to {0} from:\n\t\-(.+)".format(name), text)

TEXT is like :

enter image description here

OUTPUT:

A

['B', 'D']

K

['B']

Desired OUTPUT:

A

['B', 'C', 'D']

K

['B', 'D', 'E']

Upvotes: 2

Views: 525

Answers (3)

Quinn
Quinn

Reputation: 4504

And here is another approach (Python 2.7x):

import re
text = 'Found to A from:\n\t-B\n\t-C\nFound to K from:\n\t-B\n\t-D\n\t-E\nFound to A from:\n\t-D\nMax time: 20s'
for name in ['A', 'K']:
    print name
    print [ n for i in re.findall('(?:Found to ' + name + ' from:)(?:\\n\\t-([A-Z]))(?:\\n\\t-([A-Z]))?(?:\\n\\t-([A-Z]))?', text) for n in i if n ]

Output:

A
['B', 'C', 'D']
K
['B', 'D', 'E']

UPDATE In case you don't know how many (?:\n\t-([A-Z])), I suggest the following approach:

import re
text = 'Found to A from:\n\t-B\n\t-C\n\t-G\nFound to K from:\n\t-B\n\t-D\n\t-E\nFound to A from:\n\t-D\nMax time: 20s'
for name in ['A', 'K']:
    print name
    groups = re.findall('(?:Found to ' + name + ' from:)((?:\\n\\s*-(?:[A-Z]))+)', text)
    print reduce(lambda i,j: i + j, map(lambda x: re.findall('\n\s*-([A-Z])', x), groups))

Output:

A
['B', 'C', 'G', 'D']
K
['B', 'D', 'E']

Upvotes: 4

Kordi
Kordi

Reputation: 2465

Not generic but works in your case and is simple and is using findAll like you mentioned.

import re

text = "Found to A from:\n\t-B\n\t-C\nFound to K from:\n\t-B\n\t-D\n\t-E\nFound to A from:\n\t-D\n"

names = ['A', 'K']
for name in names:
    print name
    test = re.findall("Found to {0} from:\n\t-([A-Z])(\n\t)?-?([A-Z])?(\n\t)?-?([A-Z])?".format(name), text)
    # normalize it
    prettyList = []
    for (a,b,c,d,e) in test:
        prettyList.append(a)
        prettyList.append(c)
        prettyList.append(e)
    print [x for x in prettyList if x]

The output

A
['B', 'C', 'D']
K
['B', 'D', 'E']

I know there are lot of cases with 3 Elements and so on you have to add the additional matching.

Upvotes: 0

timgeb
timgeb

Reputation: 78650

When I was typing this answer I was trying to answer your original question where you had a file with specific content to parse. I think my answer still applies. If you have a string instead, change

for line in f:

to

for line in f.splitlines():

and pass the string instead of the file object to keys_and_values.

Original answer:

In all honesty I think this looks like a task where the heavy lifting should be done by a generator, with some help from regular expressions.

import re
from collections import OrderedDict

def keys_and_values(f):
    # discard any headers
    target = '^\s*Found to [A-Z] from:\s*$'
    for line in f:
        if re.match(target, line.strip()):
            break

    # yield (key, value) tuples
    key = line.strip()[9]
    for line in f:
        line = line.strip()
        if re.match(target, line):
            key = line[9]
        elif line:
            yield (key, line)

result = OrderedDict()
with open('testfile.txt') as f:
    for k,v in keys_and_values(f):
        result.setdefault(k, []).append(v)

for k in result:
    print('{}\n{}\n'.format(k, result[k]))

Demo:

$ cat testfile.txt 
some
useless
header
lines

Found to A from:

B

C

Found to K from:

B

D

E

Found to A from:

D
$ python parsefile.py
A
['B', 'C', 'D']

K
['B', 'D', 'E']

Upvotes: 2

Related Questions