Reputation: 1323
I have an html encoded xml payload where I'd like to use regular expressions to extract a certain piece of that xml and write that to a file. I know it's generally not good practice to use regex for xml but I think this a special use case.
Anyway here is a sample encoded xml:
<root>
<parent>
<test1>
<another>
<subelement>
<value>hello</value>
</subelement>
</another>
</test1>
</parent>
</root>
I ultimately want my result to be:
<test1>
<another>
<subelement>
<value>hello</value>
</subelement>
</another>
</test1>
Here is my implementation in python to extract out all the text between the <test1>
and </test1>
inclusively:
import html
import re
file_stream = open('/path/to/test.xmp', 'r')
raw_data = file_stream.read()
escaped_raw_data = html.unescape(raw_data)
result = re.search(r"<test1[\s\S]*?<\/test1>", escaped_raw_data)
However I get no matches for result, what am I doing wrong? How to accomplish my goal?
Upvotes: 0
Views: 57
Reputation: 5372
This works for me:
import html
import re
raw_data = '''
<root>
<parent>
<test1>
<another>
<subelement>
<value>hello</value>
</subelement>
</another>
</test1>
</parent>
</root>
'''
escaped_raw_data = html.unescape(raw_data)
result = re.search(r'(<test1>.*</test1>)', escaped_raw_data, re.DOTALL)
if result:
print(result.group(0))
Upvotes: 1