Reputation: 5088

XML parsing with Python and regex does not return all results

I am still struggling with regexp:

import re

text = '''
          <SW-VARIABLE>
            <SHORT-NAME>abc</SHORT-NAME>
            <CATEGORY>VALUE</CATEGORY>
            <SW-ARRAYSIZE>
              <VF>4</VF>
            </SW-ARRAYSIZE>
            <SW-DATA-DEF-PROPS>
              cde
            </SW-DATA-DEF-PROPS>
          </SW-VARIABLE>

          <SW-VARIABLE>
            <SHORT-NAME>def</SHORT-NAME>
            <CATEGORY>VALUE</CATEGORY>
            <SW-ARRAYSIZE>
              <VF>8</VF>
            </SW-ARRAYSIZE>
            <SW-DATA-DEF-PROPS>
                <HELLO>dsfadsf </HELLO>
                <NO>itis</NO>
            </SW-DATA-DEF-PROPS>
          </SW-VARIABLE>
'''

pattern = r'<SW-VARIABLE>\s*<SHORT-NAME>([^<]*)</SHORT-NAME>.*<SW-ARRAYSIZE>\s*<VF>([^<]*)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>'

print(re.findall(pattern, text, re.S))

This returns:

[('abc', '8')]

I would expect it to return:

[('abc', '4'), ('def', '8')]

Why is it so greedy and matches everything until the last closing tag?

This is the regex101 link: https://regex101.com/r/ANO7RA/1

Maybe negative lookahead will solve this. I was not able to fully grasp the concept, though... :-(

Upvotes: 0

Answers (4)

Barka

Reputation: 8932

I agree with others, it is best to use an xml parser here. But to fix what you have ...

You are missing a question mark. regexes are greedy by default. They grab as much as they can. To make them non-greedy, you need to add a question mark after the part that you want to be none-greedy for. This regex will give you what you want:

<SW-VARIABLE>\s*<SHORT-NAME>([^<]*)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>([^<]*)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>

you had the question mark correctly after

</SW-ARRAYSIZE>.*

but you were missing it after

</SHORT-NAME>.*

I think you want to only capture the content of the two '.*?'s. If that is the case, I would put them in groups and retrieve the groups in code to work with them. The regex will then become:

<SW-VARIABLE>\s*<SHORT-NAME>(?P<sn>[^<]*?)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>(?P<vf>[^<]*?)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>

with the two group names being sn and vf. demo

Your python code for retrieving the named groups will then become:

matches= re.search(regex, string1)
print("shortName: ", matches.group('sn'))
print("vf: ", matches.group('vf'))

Upvotes: 1

Freeman

Reputation: 12758

you can also check this out :

import re

text = '''
          <SW-VARIABLE>
            <SHORT-NAME>abc</SHORT-NAME>
            <CATEGORY>VALUE</CATEGORY>
            <SW-ARRAYSIZE>
              <VF>4</VF>
            </SW-ARRAYSIZE>
            <SW-DATA-DEF-PROPS>
              cde
            </SW-DATA-DEF-PROPS>
          </SW-VARIABLE>

          <SW-VARIABLE>
            <SHORT-NAME>def</SHORT-NAME>
            <CATEGORY>VALUE</CATEGORY>
            <SW-ARRAYSIZE>
              <VF>8</VF>
            </SW-ARRAYSIZE>
            <SW-DATA-DEF-PROPS>
                <HELLO>dsfadsf </HELLO>
                <NO>itis</NO>
            </SW-DATA-DEF-PROPS>
          </SW-VARIABLE>
'''
pattern=r'<SW-VARIABLE>\s*<SHORT-NAME>([^<].*?)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>([^<]*?)</VF>\s*</SW-ARRAYSIZE>.*?</SW-VARIABLE>'
print(re.findall(pattern, text, re.S))

output :

[('abc', '4'), ('def', '8')]

Upvotes: 1

jawad-khan

Reputation: 313

This is the pattern you need.

pattern = r'<SW-VARIABLE>\s*<SHORT-NAME>([^<].*?)</SHORT-NAME>.*?<SW-ARRAYSIZE>\s*<VF>([^<]*?)</VF>\s*</SW-ARRAYSIZE>.*?<(?:/(?!<SW-VARIABLE>)[^/]*?)SW-VARIABLE>'

Upvotes: 2

mrCarnivore

Reputation: 5088

I seem to have found an answer myself:

pattern = r'<SW-VARIABLE>\s*<SHORT-NAME>([^<]*)</SHORT-NAME>\s*<CATEGORY>[^<]*</CATEGORY>\s*<SW-ARRAYSIZE>\s*<VF>(.*)</VF>\s*</SW-ARRAYSIZE>'

print(re.findall(pattern, text))

You really have to limit the usage of .* and make use of the very predictable structure of the XML.

Upvotes: 0

XML parsing with Python and regex does not return all results

Answers (4)

Related Questions