Reputation:
I want to parse a xml file and make a dictionary based on it. And i need to use regular expressions to compute it.
<abc:ROW time_stamp="123">
<name>abcd</name>
<field1>field_value<field1>
<field2>field_value</field2>
</abc:ROW>
<abc:ROW time_stamp="456">
<name>abcd</name>
<field1>field_value<field1>
<field2>field_value</field2>
</abc:ROW>
The expected result is the List of dictionary with the key value pair such as
abcd = [{
"field1" = field_value,
"field2" = field_value
}, {
"field1" = field_value,
"field2" = field_value
}]
Someone please help.
Upvotes: 0
Views: 44
Reputation: 18357
XML should better be parsed using a lib like Beautiful Soup instead of regex. I can give you a perfect regex solution to this but I will surely attract downvotes for it :)
I would suggest you to go through the link and get familiar with it so you don't choose regex whenever you need to parse HTML/XML/Json.
Also, your XML is a little malformed, as /
is missing in some of your field tags which I have fixed.
You would need somewhat this kind of python code to parse the XML and aggregate the data as a list of dictionaries.
import re
from bs4 import BeautifulSoup
data = '''<abc:ROW time_stamp="123">
<name>abcd</name>
<field1>field_value11</field1>
<field2>field_value12</field2>
</abc:ROW>
<abc:ROW time_stamp="456">
<name>abcd</name>
<field1>field_value21</field1>
<field2>field_value22</field2>
</abc:ROW>'''
abcd = []
soup = BeautifulSoup(data)
for abcTag in soup.find_all('abc:row'):
dict = {}
dict['field1'] = abcTag.field1.get_text()
dict['field2'] = abcTag.field2.get_text()
abcd.append(dict)
print(abcd)
Which prints,
[{'field1': 'field_value11', 'field2': 'field_value12'}, {'field1': 'field_value21', 'field2': 'field_value22'}]
Hope this helps and let me know for any queries.
Edit: Solution using pure regex as per OP's special request
There are two (and can be multiple) <abc:ROW
tags so you can use this regex
(?s)<abc.*?</abc:ROW>
to match the text in tags, and then further iterate the matched text using for loop and further apply this regex,
<field1>(.*?)</field1>\s*<field2>(.*?)</field2>
to capture field1
tag's value and field2
tag's value and store them in dictionary, and add the dictionary to the abcd
list.
Here is the python code to give you the idea,
import re
data = '''<abc:ROW time_stamp="123">
<name>abcd</name>
<field1>field_value11</field1>
<field2>field_value12</field2>
</abc:ROW>
<abc:ROW time_stamp="456">
<name>abcd</name>
<field1>field_value21</field1>
<field2>field_value22</field2>
</abc:ROW>'''
abcd = []
for abcTag in re.findall(r'(?s)<abc.*?</abc:ROW>',data):
dict = {}
match = re.search(r'<field1>(.*?)</field1>\s*<field2>(.*?)</field2>', abcTag)
dict['field1'] = match.group(1)
dict['field2'] = match.group(2)
abcd.append(dict)
print(abcd)
Which prints following output as you expected,
[{'field1': 'field_value11', 'field2': 'field_value12'}, {'field1': 'field_value21', 'field2': 'field_value22'}]
Upvotes: 0