Reputation: 4077
I have this text :
INTRODUCTION
This is a test document for xml.
I need to extract this sentence.
Conclusion
It should hopefully..
The line I need to extract this sentence.
is in italics . The xml of the file looks like:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n
<w:document
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
mc:Ignorable="w14 w15 wp14">
<w:body>
<w:p w:rsidR="00470EEF" w:rsidRDefault="00456755">
<w:pPr>
<w:rPr>
<w:b/>
</w:rPr>
</w:pPr>
<w:r w:rsidRPr="00456755">
<w:rPr>
<w:b/>
</w:rPr>
<w:t>INTRODUCTION</w:t>
</w:r>
</w:p>
<w:p w:rsidR="00456755" w:rsidRPr="00B042E3" w:rsidRDefault="00456755">
<w:pPr>
<w:rPr>
<w:color w:val="FFFF00"/>
</w:rPr>
</w:pPr>
<w:r w:rsidRPr="00B042E3">
<w:rPr>
<w:color w:val="FFFF00"/>
</w:rPr>
<w:t>This is a test document for xml.</w:t>
</w:r>
</w:p>
<w:p w:rsidR="00456755" w:rsidRDefault="00E971E1">
<w:r>
<w:rPr>
<w:i/>
</w:rPr>
<w:t>I need to extract this sentence.</w:t>
</w:r>
<w:bookmarkStart w:id="0" w:name="_GoBack"/>
<w:bookmarkEnd w:id="0"/>
</w:p>
<w:p w:rsidR="00456755" w:rsidRDefault="00456755"/>
<w:p w:rsidR="00456755" w:rsidRDefault="00456755">
<w:pPr>
<w:rPr>
<w:b/>
</w:rPr>
</w:pPr>
<w:r w:rsidRPr="00456755">
<w:rPr>
<w:b/>
</w:rPr>
<w:t>Conclusion</w:t>
</w:r>
</w:p>
<w:p w:rsidR="00456755" w:rsidRPr="00456755" w:rsidRDefault="00456755">
<w:r w:rsidRPr="00456755">
<w:t>It should hopefully</w:t>
</w:r>
<w:r>
<w:t>..</w:t>
</w:r>
</w:p>
<w:sectPr w:rsidR="00456755" w:rsidRPr="00456755">
<w:pgSz w:w="11906" w:h="16838"/>
<w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/>
<w:cols w:space="708"/>
<w:docGrid w:linePitch="360"/>
</w:sectPr>
</w:body>
</w:document>
I tried :
tree = ET.parse(doc_xml)
[b.tag for b in tree.iterfind(".//i")]
The above returns an empty list.
I've searched a lot but wasn't able to figure out how to do that as the text is contained within <w:i/>
. I have seen this question where this was done easily using BeautifulSoup.
Edit: This isn't related exactly but this is an ElementTree approach to extract all text.
w = 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'
for p in source.findall('.//{' + w + '}p'):
print ''.join(t.text for t in p.findall('.//{' + w + '}t'))
Upvotes: 1
Views: 697
Reputation: 369134
Slightly modifying the you will get what you want:
>>> w = 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'
>>> for t in tree.findall('.//{%(ns)s}i/../..//{%(ns)s}t' % {'ns': w}):
... print t.text
...
I need to extract this sentence.
BTW, if you use local-name()
, you don't need to specify namespace (need to use xpath
method, which is available in lxml
):
>>> for t in tree.xpath('.//*[local-name()="i"]/../..//*[local-name()="t"]'):
... print t.text
...
I need to extract this sentence.
UPDATE
..
in the expression selects parent node of the current node. So, {...}i/../..
will select grand-parent node of i
node.
Upvotes: 2
Reputation: 89295
Building my answer based on your code in Edit section :
w = 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'
for p in source.findall('.//{' + w + '}p[.//{' + w + '}i]'):
print ''.join(t.text for t in p.findall('.//{' + w + '}t'))
Basically, the first XPath supposed to match all <w:p>
elements having descendant node <w:i>
, then as you know the next line extract all <w:t>
nodes' text from matched <w:p>
nodes.
Upvotes: 2