Reputation: 859
I am using Python element tree to parse xml file
lets say i have an xml file like this ..
<html>
<head>
<title>Example page</title>
</head>
<body>
<p>hello this is first paragraph </p>
<p> hello this is second paragraph</p>
</body>
</html>
is there any way i can extract the body with the p tags intact like
desired= "<p>hello this is first paragraph </p> <p> hello this is second paragraph</p>"
Upvotes: 0
Views: 592
Reputation: 27744
A slightly different way to @DavidAlber, where the children could easily be selected:
from xml.etree import ElementTree
tree = ElementTree.parse("example.xml")
body = tree.findall("/body/p")
result = []
for elem in body:
result.append(ElementTree.tostring(elem).strip())
print " ".join(result)
Upvotes: 0
Reputation: 18111
The following code does the trick.
import xml.etree.ElementTree as ET
root = ET.fromstring(doc) # doc is a string containing the example file
body = root.find('body')
desired = ' '.join([ET.tostring(c).strip() for c in body.getchildren()])
Now:
>>> desired
'<p>hello this is first paragraph </p> <p> hello this is second paragraph</p>'
Upvotes: 1
Reputation: 21466
You can use lxml library, lxml
So, this code will help you.
import lxml.html
htmltree = lxml.html.parse('''
<html>
<head>
<title>Example page</title>
</head>
<body>
<p>hello this is first paragraph </p>
<p> hello this is second paragraph</p>
</body>
</html>''')
p_tags = htmltree.xpath('//p')
p_content = [p.text_content() for p in p_tags]
print p_content
Upvotes: 0