to parse xml file in python

Question

I want to parse a xml file and make a dictionary based on it. And i need to use regular expressions to compute it.


    abcd
    field_value
    field_value


    abcd
    field_value
    field_value

The expected result is the List of dictionary with the key value pair such as

abcd = [{
  "field1" = field_value,
  "field2" = field_value
}, {
  "field1" = field_value,
  "field2" = field_value
}]

Someone please help.

Pushpesh Kumar Rajwanshi · Accepted Answer

XML should better be parsed using a lib like Beautiful Soup instead of regex. I can give you a perfect regex solution to this but I will surely attract downvotes for it :)

I would suggest you to go through the link and get familiar with it so you don't choose regex whenever you need to parse HTML/XML/Json.

Also, your XML is a little malformed, as / is missing in some of your field tags which I have fixed.

You would need somewhat this kind of python code to parse the XML and aggregate the data as a list of dictionaries.

import re
from bs4 import BeautifulSoup

data = '''
    abcd
    field_value11
    field_value12


    abcd
    field_value21
    field_value22
'''


abcd = []
soup = BeautifulSoup(data)

for abcTag in soup.find_all('abc:row'):
 dict = {}
 dict['field1'] = abcTag.field1.get_text()
 dict['field2'] = abcTag.field2.get_text()
 abcd.append(dict)

print(abcd)

Which prints,

[{'field1': 'field_value11', 'field2': 'field_value12'}, {'field1': 'field_value21', 'field2': 'field_value22'}]

Hope this helps and let me know for any queries.

Edit: Solution using pure regex as per OP's special request

There are two (and can be multiple) tags so you can use this regex



(?s)


to match the text in tags, and then further iterate the matched text using for loop and further apply this regex,

(.*?)\s*(.*?)


to capture field1 tag's value and field2 tag's value and store them in dictionary, and add the dictionary to the abcd list.

Here is the python code to give you the idea,

import re

data = '''
    abcd
    field_value11
    field_value12


    abcd
    field_value21
    field_value22
'''


abcd = []

for abcTag in re.findall(r'(?s)',data):
 dict = {}
 match = re.search(r'(.*?)\s*(.*?)', abcTag)
 dict['field1'] = match.group(1)
 dict['field2'] = match.group(2)
 abcd.append(dict)

print(abcd)


Which prints following output as you expected,

[{'field1': 'field_value11', 'field2': 'field_value12'}, {'field1': 'field_value21', 'field2': 'field_value22'}]

to parse xml file in python

Answers (1)

Related Questions