alvas
alvas

Reputation: 121992

Why am I not getting the text in the XML tag? - python elementtree

How do i read the all the text within in the <context>...</context> tag? And how about the <head>...<\head> tag within the <context \> tag?

I've an XML file that looks like this:

<corpus lang="english">
    <lexelt item="coach.n">
        <instance id="1">
            <context>I'll buy a train or <head>coach</head> ticket.</context>
        </instance>
        <instance id="2">
            <context>A branch line train took us to Aubagne where a <head>coach</head> picked us up for the journey up to the camp.</context>
        </instance>
    </lexelt>
</corpus>

But when i ran my code to read the XML text within the ..., I'm only getting the text until i reach the tag.

import xml.etree.ElementTree as et    
inputfile = "./coach.data"    
root = et.parse(open(inputfile)).getroot()
instances = []

for corpus in root:
    for lexelt in corpus:
      for instance in lexelt:
        instances.append(instance.text)

j=1
for i in instances:
    print "instance " + j
    print "left: " + i
    print "\n"  
    j+=1

Now I'm just getting the left side:

instance 1
left: I'll buy a train or 

instance 2
left: A branch line train took us to Aubagne where a 

The output needs also the right side of the context and the head, it should be:

instance 1
left: I'll buy a train or 
head: coach
right:   ticket.

instance 2
left: A branch line train took us to Aubagne where a 
head: coach
right:  picked us up for the journey up to the camp.

Upvotes: 1

Views: 1199

Answers (2)

Lior
Lior

Reputation: 2631

First of all, you have a mistake in your code. for corpus in root is not necessary, your root is already corpus.

What you probably meant to do was:

for lexelt in root:
  for instance in lexelt:
    for context in instance:
      contexts.append(context.text)

Now, regarding your question - inside the for context in instance block, you can access the other two strings you need:

  1. The head text can be accessed by accessing context.find('head').text
  2. The text in the right of your head element can be read by accessing context.find('head').tail According to the Python etree docs:

The tail attribute can be used to hold additional data associated with the element. This attribute is usually a string but may be any application-specific object. If the element is created from an XML file the attribute will contain any text found after the element’s end tag and before the next tag.

Upvotes: 3

swang
swang

Reputation: 219

Within ElementTree you will have to consider the tail property of child nodes. Also corpus IS root in your case.


    import xml.etree.ElementTree as et    
    inputfile = "./coach.data"    
    corpus = et.parse(open(inputfile)).getroot()

    def getalltext(elem):
        return elem.text + ''.join([getalltext(child) + child.tail for child in elem])

    instances = []
    for lexelt in corpus:
        for instance in lexelt:
            instances.append(getalltext(instance))


    j=1
    for i in instances:
        print "instance " + j
        print "left: " + i
        print "\n"  
        j+=1

Upvotes: 1

Related Questions