Omair .
Omair .

Reputation: 344

extract xml tag name, attributes and values

I am looking for a way to extract all information available in a xml file to a flat file or a database.

For example

<r>
                <P>
                    <color val="1F497D"/>
                </P>
                <t val="123" val2="234">TEST REPORT</t>
</r>

I would want this as

r
P
color,val,1f497d
t,val,123
t,val2,234

Any pointers on how to go about this in python?

Upvotes: 0

Views: 145

Answers (3)

jdotjdot
jdotjdot

Reputation: 17052

I'm not really sure why you'd want this, but you should take a look at lxml or BeautifulSoup for Python.

Alternatively, if you just want it in exactly the form you've presented above:

def parse_html(html_string):
    import re
    fields = re.findall(r'(?<=\<)[\w=\s\"\']+?(?=\/?\>)', html_string)
    out = []
    for field in fields:
        tag = re.match(r'(?P<tag>\w+?) ?', field).group('tag')
        attrs = re.findall(r' (\w+?)\=[\"\'](.+?)[\"\']', field)
        if attrs:
            for x in attrs:
                out.append(','.join([tag] + list(x)))
        else:
            out.append(tag)

    print '\n'.join(out)

This is a little over the top, and that's why you generally should use lxml or BeautifulSoup, but it gets this particular job done.

Output of my above program:

r
P
c,val,1F497D
t,val,123
t,val2,234

Upvotes: 0

Burhan Khalid
Burhan Khalid

Reputation: 174624

Install lxml then:

>>> from lxml import etree
>>> parser = etree.XMLParser(remove_blank_text=True)
>>> parsed_xml = etree.XML(s,parser)
>>> for i in parsed_xml.iter('*'):
...    print i.tag
...    for x in i.items():
...       print '%s,%s' % (x[0],x[1])
...
r
P
color
val,1F497D
t
val,123
val2,234

I'll leave it up to you to format the output.

Upvotes: 1

Anuj Gupta
Anuj Gupta

Reputation: 10526

I think your best bet is to use BeautifulSoup

e.g. (from their docs):

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
soup.title
# <title>The Dormouse's story</title>
soup.p['class']
# u'title'
for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

You could also take a look at lxml, it's easy and efficient, and it's what BeautifulSoup is built upon. Specifically, you might want to take a look at this page.

Upvotes: 0

Related Questions