Reputation: 5088
I am still struggling with regexp:
import re
text = '''
<SW-VARIABLE>
<SHORT-NAME>abc</SHORT-NAME>
<CATEGORY>VALUE</CATEGORY>
<SW-ARRAYSIZE>
<VF>4</VF>
</SW-ARRAYSIZE>
<SW-DATA-DEF-PROPS>
cde
</SW-DATA-DEF-PROPS>
</SW-VARIABLE>
<SW-VARIABLE>
<SHORT-NAME>def</SHORT-NAME>
<CATEGORY>VALUE</CATEGORY>
<SW-ARRAYSIZE>
<VF>8</VF>
</SW-ARRAYSIZE>
<SW-DATA-DEF-PROPS>
<HELLO>dsfadsf </HELLO>
<NO>itis</NO>
</SW-DATA-DEF-PROPS>
</SW-VARIABLE>
'''
pattern = r'<SW-VARIABLE>\s*<SHORT-NAME>([^<]*)</SHORT-NAME>.*<SW-ARRAYSIZE>\s*<VF>([^<]*)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>'
print(re.findall(pattern, text, re.S))
This returns:
[('abc', '8')]
I would expect it to return:
[('abc', '4'), ('def', '8')]
Why is it so greedy and matches everything until the last closing tag?
This is the regex101 link: https://regex101.com/r/ANO7RA/1
Maybe negative lookahead will solve this. I was not able to fully grasp the concept, though... :-(
Upvotes: 0
Views: 71
Reputation: 8932
I agree with others, it is best to use an xml parser here. But to fix what you have ...
You are missing a question mark. regexes are greedy by default. They grab as much as they can. To make them non-greedy, you need to add a question mark after the part that you want to be none-greedy for. This regex will give you what you want:
<SW-VARIABLE>\s*<SHORT-NAME>([^<]*)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>([^<]*)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>
you had the question mark correctly after
</SW-ARRAYSIZE>.*
but you were missing it after
</SHORT-NAME>.*
.
I think you want to only capture the content of the two '.*?'s. If that is the case, I would put them in groups and retrieve the groups in code to work with them. The regex will then become:
<SW-VARIABLE>\s*<SHORT-NAME>(?P<sn>[^<]*?)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>(?P<vf>[^<]*?)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>
with the two group names being sn and vf. demo
Your python code for retrieving the named groups will then become:
matches= re.search(regex, string1)
print("shortName: ", matches.group('sn'))
print("vf: ", matches.group('vf'))
Upvotes: 1
Reputation: 12758
you can also check this out :
import re
text = '''
<SW-VARIABLE>
<SHORT-NAME>abc</SHORT-NAME>
<CATEGORY>VALUE</CATEGORY>
<SW-ARRAYSIZE>
<VF>4</VF>
</SW-ARRAYSIZE>
<SW-DATA-DEF-PROPS>
cde
</SW-DATA-DEF-PROPS>
</SW-VARIABLE>
<SW-VARIABLE>
<SHORT-NAME>def</SHORT-NAME>
<CATEGORY>VALUE</CATEGORY>
<SW-ARRAYSIZE>
<VF>8</VF>
</SW-ARRAYSIZE>
<SW-DATA-DEF-PROPS>
<HELLO>dsfadsf </HELLO>
<NO>itis</NO>
</SW-DATA-DEF-PROPS>
</SW-VARIABLE>
'''
pattern=r'<SW-VARIABLE>\s*<SHORT-NAME>([^<].*?)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>([^<]*?)</VF>\s*</SW-ARRAYSIZE>.*?</SW-VARIABLE>'
print(re.findall(pattern, text, re.S))
output :
[('abc', '4'), ('def', '8')]
Upvotes: 1
Reputation: 313
This is the pattern you need.
pattern = r'<SW-VARIABLE>\s*<SHORT-NAME>([^<].*?)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>([^<]*?)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>'
Upvotes: 2
Reputation: 5088
I seem to have found an answer myself:
pattern = r'<SW-VARIABLE>\s*<SHORT-NAME>([^<]*)</SHORT-NAME>\s*<CATEGORY>[^<]*</CATEGORY>\s*<SW-ARRAYSIZE>\s*<VF>(.*)</VF>\s*</SW-ARRAYSIZE>'
print(re.findall(pattern, text))
You really have to limit the usage of .*
and make use of the very predictable structure of the XML.
Upvotes: 0