user10561801
user10561801

Reputation:

to parse xml file in python

I want to parse a xml file and make a dictionary based on it. And i need to use regular expressions to compute it.

<abc:ROW time_stamp="123">
    <name>abcd</name>
    <field1>field_value<field1>
    <field2>field_value</field2>
</abc:ROW>
<abc:ROW time_stamp="456">
    <name>abcd</name>
    <field1>field_value<field1>
    <field2>field_value</field2>
</abc:ROW>

The expected result is the List of dictionary with the key value pair such as

abcd = [{
  "field1" = field_value,
  "field2" = field_value
}, {
  "field1" = field_value,
  "field2" = field_value
}] 

Someone please help.

Upvotes: 0

Views: 44

Answers (1)

Pushpesh Kumar Rajwanshi
Pushpesh Kumar Rajwanshi

Reputation: 18357

XML should better be parsed using a lib like Beautiful Soup instead of regex. I can give you a perfect regex solution to this but I will surely attract downvotes for it :)

I would suggest you to go through the link and get familiar with it so you don't choose regex whenever you need to parse HTML/XML/Json.

Also, your XML is a little malformed, as / is missing in some of your field tags which I have fixed.

You would need somewhat this kind of python code to parse the XML and aggregate the data as a list of dictionaries.

import re
from bs4 import BeautifulSoup

data = '''<abc:ROW time_stamp="123">
    <name>abcd</name>
    <field1>field_value11</field1>
    <field2>field_value12</field2>
</abc:ROW>
<abc:ROW time_stamp="456">
    <name>abcd</name>
    <field1>field_value21</field1>
    <field2>field_value22</field2>
</abc:ROW>'''


abcd = []
soup = BeautifulSoup(data)

for abcTag in soup.find_all('abc:row'):
 dict = {}
 dict['field1'] = abcTag.field1.get_text()
 dict['field2'] = abcTag.field2.get_text()
 abcd.append(dict)

print(abcd)

Which prints,

[{'field1': 'field_value11', 'field2': 'field_value12'}, {'field1': 'field_value21', 'field2': 'field_value22'}]

Hope this helps and let me know for any queries.

Edit: Solution using pure regex as per OP's special request

There are two (and can be multiple) <abc:ROW tags so you can use this regex

(?s)<abc.*?</abc:ROW>

to match the text in tags, and then further iterate the matched text using for loop and further apply this regex,

<field1>(.*?)</field1>\s*<field2>(.*?)</field2>

to capture field1 tag's value and field2 tag's value and store them in dictionary, and add the dictionary to the abcd list.

Here is the python code to give you the idea,

import re

data = '''<abc:ROW time_stamp="123">
    <name>abcd</name>
    <field1>field_value11</field1>
    <field2>field_value12</field2>
</abc:ROW>
<abc:ROW time_stamp="456">
    <name>abcd</name>
    <field1>field_value21</field1>
    <field2>field_value22</field2>
</abc:ROW>'''


abcd = []

for abcTag in re.findall(r'(?s)<abc.*?</abc:ROW>',data):
 dict = {}
 match = re.search(r'<field1>(.*?)</field1>\s*<field2>(.*?)</field2>', abcTag)
 dict['field1'] = match.group(1)
 dict['field2'] = match.group(2)
 abcd.append(dict)

print(abcd)

Which prints following output as you expected,

[{'field1': 'field_value11', 'field2': 'field_value12'}, {'field1': 'field_value21', 'field2': 'field_value22'}]

Upvotes: 0

Related Questions