supersuraccoon
supersuraccoon

Reputation: 1691

python regex: multiline and non-greedy

I have some text like this:

cc.Action = {
};

cc.FiniteTimeAction = {

};

cc.Speed = {

};

And I the result (list) I want is:

['cc.Action = {}', 'cc.FiniteTimeAction = {}', 'cc.Speed = {}']

And here's what I have tried:

input = codecs.open(self.input_file, "r", "utf-8")
content = input.read()
result = re.findall('cc\..*= {.*};', content, re.S)
for r in result:
    print r
    print '---------------'

And the result is:

[
'cc.Action = {
};

cc.FiniteTimeAction = {

};

cc.Speed = {

};'
]

Any suggestion will be appreciated, thanks :)

Upvotes: 4

Views: 2819

Answers (5)

slawek
slawek

Reputation: 2779

The beginning of the match seems to be cc. and the end of match seems to be ; so we can use pattern:

'cc\.[^;]+'

Meaning, we match cc. and then match every character which is not ; ([] encloses character class, ^ negates the class).

You could also use non-greedy repeat *?, but in this case I would say it's an overkill. The simpler the regex is the better.

To get desired input you would also have to get rid of newlines. Together I would propose:

result = re.findall('cc\.[^;]*;', content.replace('\n', ''))

Upvotes: 1

aelor
aelor

Reputation: 11124

>>> 'cc.Action = {\n};\n\ncc.FiniteTimeAction = {\n\n};\n\ncc.Speed = {\n\n};'.replace('\n','').split(";")
['cc.Action = {}', 'cc.FiniteTimeAction = {}', 'cc.Speed = {}', '']

this will work for you

Upvotes: 0

thefourtheye
thefourtheye

Reputation: 239683

The problem is, you are using greedy search. You need to use non-greedy search with ? operator

import re
print [i.replace("\n", "") for i in re.findall(r"cc\..*?{.*?}", data, re.DOTALL)]
# ['cc.Action = {}', 'cc.FiniteTimeAction = {}', 'cc.Speed = {}']

If you don't use .*?, .*{ will match till the last { in the string. So, all the strings are considered as a single string. When you do non-greedy match, it matches till the first { from the current character.

Also, this can be done without using RegEx, like this

print [item.replace("\n", "") for item in data.split(";") if item]
# ['cc.Action = {}', 'cc.FiniteTimeAction = {}', 'cc.Speed = {}']

Just split the string based on ; and if the current string is not empty, then replace all the \n (newline characters) with empty strings.

Upvotes: 0

venpa
venpa

Reputation: 4318

If you split based on ;:

codes.split(';')

Output:

['cc.Action = {}', ' cc.FiniteTimeAction = {}', 'cc.Speed = {}', '']

Upvotes: 0

Robin
Robin

Reputation: 9644

As your title suggests, the issue is greediness: cc\..*= matches from the beginning of the string to the last =.

You can avoid this behavior by using lazy quantifier that will try to stop at the earliest occurrence of the following character:

cc\..*?= {.*?};

Demo here: http://regex101.com/r/oL4yG7.

Upvotes: 0

Related Questions