Reputation: 1691
I have some text like this:
cc.Action = {
};
cc.FiniteTimeAction = {
};
cc.Speed = {
};
And I the result (list) I want is:
['cc.Action = {}', 'cc.FiniteTimeAction = {}', 'cc.Speed = {}']
And here's what I have tried:
input = codecs.open(self.input_file, "r", "utf-8")
content = input.read()
result = re.findall('cc\..*= {.*};', content, re.S)
for r in result:
print r
print '---------------'
And the result is:
[
'cc.Action = {
};
cc.FiniteTimeAction = {
};
cc.Speed = {
};'
]
Any suggestion will be appreciated, thanks :)
Upvotes: 4
Views: 2819
Reputation: 2779
The beginning of the match seems to be cc.
and the end of match seems to be ;
so we can use pattern:
'cc\.[^;]+'
Meaning, we match cc.
and then match every character which is not ;
([]
encloses character class, ^
negates the class).
You could also use non-greedy repeat *?
, but in this case I would say it's an overkill. The simpler the regex is the better.
To get desired input you would also have to get rid of newlines. Together I would propose:
result = re.findall('cc\.[^;]*;', content.replace('\n', ''))
Upvotes: 1
Reputation: 11124
>>> 'cc.Action = {\n};\n\ncc.FiniteTimeAction = {\n\n};\n\ncc.Speed = {\n\n};'.replace('\n','').split(";")
['cc.Action = {}', 'cc.FiniteTimeAction = {}', 'cc.Speed = {}', '']
this will work for you
Upvotes: 0
Reputation: 239683
The problem is, you are using greedy search. You need to use non-greedy search with ?
operator
import re
print [i.replace("\n", "") for i in re.findall(r"cc\..*?{.*?}", data, re.DOTALL)]
# ['cc.Action = {}', 'cc.FiniteTimeAction = {}', 'cc.Speed = {}']
If you don't use .*?
, .*{
will match till the last {
in the string. So, all the strings are considered as a single string. When you do non-greedy match, it matches till the first {
from the current character.
Also, this can be done without using RegEx, like this
print [item.replace("\n", "") for item in data.split(";") if item]
# ['cc.Action = {}', 'cc.FiniteTimeAction = {}', 'cc.Speed = {}']
Just split the string based on ;
and if the current string is not empty, then replace all the \n
(newline characters) with empty strings.
Upvotes: 0
Reputation: 4318
If you split based on ;
:
codes.split(';')
Output:
['cc.Action = {}', ' cc.FiniteTimeAction = {}', 'cc.Speed = {}', '']
Upvotes: 0
Reputation: 9644
As your title suggests, the issue is greediness: cc\..*=
matches from the beginning of the string to the last =
.
You can avoid this behavior by using lazy quantifier that will try to stop at the earliest occurrence of the following character:
cc\..*?= {.*?};
Demo here: http://regex101.com/r/oL4yG7.
Upvotes: 0