Reputation: 9
I have a newspaper in xml format and I am trying to parse specific parts.
My XML looks like the following:
<?xml version="1.0" encoding="UTF-8"?>
<articles>
<text>
<text.cr>
<pg pgref="1" clipref="1" pos="0,0,2275,3149"/>
<p type="none">
<wd pos="0,0,0,0"/>
</p>
</text.cr>
<text.cr>
<pg pgref="1" clipref="2" pos="0,0,2275,3149"/>
<p type="none">
<wd pos="0,0,0,0"/>
</p>
</text.cr>
<text.cr>
<pg pgref="1" clipref="3" pos="4,32,1078,454"/>
<p type="none">
<wd pos="4,32,1078,324">The</wd>
<wd pos="12,234,1078,450">Newspaper</wd>
</p>
</text.cr>
I want to parse "The" and "Newspaper" amongst others.
I used xml.etree.ElementTree
and my code looks like this:
import xml.etree.ElementTree as ET
for each_file in entries:
mytree = ET.parse(path.xml)
tree = mytree.findall('text')
for x in tree:
x_ = x.findall('wd')
I managed to parse the root and also the attributes, but I don't know how to address 'wd'
Thanks for the help
Upvotes: 0
Views: 71
Reputation: 23815
Below
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<articles>
<text>
<text.cr>
<pg pgref="1" clipref="1" pos="0,0,2275,3149"/>
<p type="none">
<wd pos="0,0,0,0"/>
</p>
</text.cr>
<text.cr>
<pg pgref="1" clipref="2" pos="0,0,2275,3149"/>
<p type="none">
<wd pos="0,0,0,0"/>
</p>
</text.cr>
<text.cr>
<pg pgref="1" clipref="3" pos="4,32,1078,454"/>
<p type="none">
<wd pos="4,32,1078,324">The</wd>
<wd pos="12,234,1078,450">Newspaper</wd>
</p>
</text.cr></text></articles>'''
values = ['The', 'Newspaper']
root = ET.fromstring(xml)
wds = [wd for wd in root.findall('.//wd') if wd.text in values]
for wd in wds:
print(wd.attrib['pos'])
output
4,32,1078,324
12,234,1078,450
Upvotes: 0
Reputation: 24930
Change your loop to
for x in tree:
x_ = x.findall('.//wd')
for t in x_:
if t.text is not None:
print(t.text)
Output:
The
Newspaper
Upvotes: 1