Regular expression in python re.findall()

I tryed the folowing:

I want to split with the re.findall()

str="<abc>somechars<*><def>somechars<*><ghj>somechars<*><ijk>somechars<*>"
print(re.findall('<(abc|ghj)>.*?<*>',str))

The out should be

['<abc>somechars<*>','<ghj>somechars<*>']

In notepad, if I try this expression I get right, but here:

['abc', 'ghj']

Any idea? Thanks for the answers.

Upvotes: 1

Views: 126

Answers (3)

senshin
senshin

Reputation: 10350

You're capturing (abc|ghj). Use a non-capturing group (?:abc|ghj) instead.

Also, you should escape the second * in your regex since you want a literal asterisk: <\*> rather than <*>.

>>> s = '<abc>somechars<*><def>somechars<*><ghj>somechars<*><ijk>somechars<*>'
>>> re.findall(r'<(?:abc|ghj)>.*?<\*>', s)
['<abc>somechars<*>', '<ghj>somechars<*>']

Also also, avoid shadowing the built-in name str.

Upvotes: 1

Nathan
Nathan

Reputation: 1482

Just make the group a non-capturing group:

str="<abc>somechars<*><def>somechars<*><ghj>somechars<*><ijk>somechars<*>"
print(re.findall('<(?:abc|ghj)>.*?<*>',str))

The function returns the groups from left to right, and since you specified a group it left out the entire match.

From the Python documentation

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match

.

Upvotes: 0

vks
vks

Reputation: 67968

(<(?:abc|ghj)>.*?<\*>)

Try this.See demo.

http://regex101.com/r/kP8uF5/12

import re
p = re.compile(ur'(<(?:abc|ghj)>.*?<\*>)', re.IGNORECASE | re.MULTILINE)
test_str = u"<abc>somechars<*><def>somechars<*><ghj>somechars<*><ijk>somechars<*>"

re.findall(p, test_str)

Upvotes: 3

Related Questions