Reputation: 121992
How do i read the all the text within in the <context>...</context>
tag? And how about the <head>...<\head>
tag within the <context \>
tag?
I've an XML file that looks like this:
<corpus lang="english">
<lexelt item="coach.n">
<instance id="1">
<context>I'll buy a train or <head>coach</head> ticket.</context>
</instance>
<instance id="2">
<context>A branch line train took us to Aubagne where a <head>coach</head> picked us up for the journey up to the camp.</context>
</instance>
</lexelt>
</corpus>
But when i ran my code to read the XML text within the ..., I'm only getting the text until i reach the tag.
import xml.etree.ElementTree as et
inputfile = "./coach.data"
root = et.parse(open(inputfile)).getroot()
instances = []
for corpus in root:
for lexelt in corpus:
for instance in lexelt:
instances.append(instance.text)
j=1
for i in instances:
print "instance " + j
print "left: " + i
print "\n"
j+=1
Now I'm just getting the left side:
instance 1
left: I'll buy a train or
instance 2
left: A branch line train took us to Aubagne where a
The output needs also the right side of the context and the head, it should be:
instance 1
left: I'll buy a train or
head: coach
right: ticket.
instance 2
left: A branch line train took us to Aubagne where a
head: coach
right: picked us up for the journey up to the camp.
Upvotes: 1
Views: 1199
Reputation: 2631
First of all, you have a mistake in your code. for corpus in root
is not necessary, your root is already corpus
.
What you probably meant to do was:
for lexelt in root:
for instance in lexelt:
for context in instance:
contexts.append(context.text)
Now, regarding your question - inside the for context in instance
block, you can access the other two strings you need:
head
text can be accessed by accessing context.find('head').text
head
element can be read by accessing context.find('head').tail
According to the Python etree docs:The
tail
attribute can be used to hold additional data associated with the element. This attribute is usually a string but may be any application-specific object. If the element is created from an XML file the attribute will contain any text found after the element’s end tag and before the next tag.
Upvotes: 3
Reputation: 219
Within ElementTree you will have to consider the tail property of child nodes. Also corpus IS root in your case.
import xml.etree.ElementTree as et inputfile = "./coach.data" corpus = et.parse(open(inputfile)).getroot() def getalltext(elem): return elem.text + ''.join([getalltext(child) + child.tail for child in elem]) instances = [] for lexelt in corpus: for instance in lexelt: instances.append(getalltext(instance)) j=1 for i in instances: print "instance " + j print "left: " + i print "\n" j+=1
Upvotes: 1