Regex python encoded xml

Question

I have an html encoded xml payload where I'd like to use regular expressions to extract a certain piece of that xml and write that to a file. I know it's generally not good practice to use regex for xml but I think this a special use case.

Anyway here is a sample encoded xml:

<root>
    <parent>
        <test1>
            <another>
                <subelement>
                    <value>hello</value>
                </subelement>
            </another>
        </test1>
    </parent>
</root>

I ultimately want my result to be:


    
        
            hello

Here is my implementation in python to extract out all the text between the and inclusively:

import html
import re

file_stream = open('/path/to/test.xmp', 'r')
raw_data = file_stream.read()
escaped_raw_data = html.unescape(raw_data)

result = re.search(r"", escaped_raw_data)

However I get no matches for result, what am I doing wrong? How to accomplish my goal?

accdias · Accepted Answer

This works for me:

import html
import re

raw_data = '''
<root>
    <parent>
        <test1>
            <another>
                <subelement>
                    <value>hello</value>
                </subelement>
            </another>
        </test1>
    </parent>
</root>
'''

escaped_raw_data = html.unescape(raw_data)

result = re.search(r'(.*)', escaped_raw_data, re.DOTALL)

if result:
    print(result.group(0))

Regex python encoded xml

Answers (1)

Related Questions