theteddyboy
theteddyboy

Reputation: 386

How to get value from XML Tag in Python?

I have XML file as below.

<?xml version="1.0" encoding="UTF-8"?><searching>
   <query>query01</query>
   <document id="0">
      <title>lord of the rings.</title>
    <snippet>
      this is a snippet of a document.
    </snippet>
      <url>http://www.google.com/</url>
   </document>
   <document id="1">
      <title>harry potter.</title>
    <snippet>
            this is a snippet of a document.
    </snippet>
      <url>http://www.google.com/</url>
   </document>
   ........ #and other documents .....

  <group id="0" size="298" score="145">
      <title>
         <phrase>GROUP A</phrase>
      </title>
      <document refid="0"/>
      <document refid="1"/>
      <document refid="84"/>
   </group>
  <group id="0" size="298" score="55">
      <title>
         <phrase>GROUP B</phrase>
      </title>
      <document refid="2"/>
      <document refid="13"/>
      <document refid="3"/>
   </group>
   </<searching>>

I want to get the group name above and what are the document id (and its title) in each group. My idea is store document id and document title into dictionary as:

import codecs
documentID = {}    
group = {}

myfile = codecs.open("file.xml", mode = 'r', encoding = "utf8")
for line in myfile:
    line = line.strip()
    #get id from tags
    #get title from tag
    #store in documentID 


    #get group name and document reference

Moreover, I have tried BeautifulSoup but very new to it. I don't know how to do. this is the code I am doing.

def outputCluster(rFile):
    documentInReadFile = {}         #dictionary to store all document in readFile

    myfile = codecs.open(rFile, mode='r', encoding="utf8")
    soup = BeautifulSoup(myfile)
    # print all text in readFile:
    # print soup.prettify()

    # print soup.find+_all('title')

outputCluster("file.xml")

Please kindly leave me some suggestion. Thank you.

Upvotes: 4

Views: 36475

Answers (4)

PhilDenfer
PhilDenfer

Reputation: 270

BeautifulSoup is nice to use, a bit surprising at first.

soup = BeautifulSoup(myfile)

soup becomes the whole file, then you have to search through it to find the part you need, for instance :

group = soup.find(name="group, attrs={'id':'0', 'size':'298'}")

group now contains the tag group and its contents (the first matching group it found) :

<group>blabla its contents<tag inside it>blabla</tag inside it>etc.</group>

do this a number of times to get to the lowermost tags, the more detailed the less chances to land on the wrong tag, then

lastthingyoufound.find(name='phrase')

will contain your answer, which will still contain the tags so you need to use another function depending on bs version. use findall to make lists on which you can iterate to find multiple elements, and feel free to keep track of old tags so you can find other info later, rather than doing soup=soup.find(...), which means you're only looking for one specific thing and lose tags in between, which is the same as doing soup = find(...).find(...).findall(...)[-1].find(...)['id'], for instance.

Upvotes: 0

TheSoundDefense
TheSoundDefense

Reputation: 6935

The previous posters have the right of it. The etree documentation can be found here:

https://docs.python.org/2/library/xml.etree.elementtree.html#module-xml.etree.ElementTree

And can help you out. Here's a code sample that might do the trick (partially taken from the above link):

import xml.etree.ElementTree as ET
tree = ET.parse('your_file.xml')
root = tree.getroot()

for group in root.findall('group'):
  title = group.find('title')
  titlephrase = title.find('phrase').text
  for doc in group.findall('document'):
    refid = doc.get('refid')

Or if you want the ID stored in the group tag, you'd use id = group.get('id') instead of searching for all the refids.

Upvotes: 3

Maximas
Maximas

Reputation: 712

Elementree is brilliant for looking through XML. If you go into the docs, it shows you how to manipulate the XML in many ways, including how to get the contents of a tag. An exmaple from the docs is:
XML:

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

Code:

>>> for country in root.findall('country'):
...   rank = country.find('rank').text
...   name = country.get('name')
...   print name, rank
...
Liechtenstein 1
Singapore 4
Panama 68

Which you could manipulate easily enough to do what you want.

Upvotes: 2

he1ix
he1ix

Reputation: 380

Did you have a look at Python's XML etree parser? There are plenty of examples on the web.

Upvotes: 2

Related Questions