Reputation: 35
I want to print the "Printable String" part of the code. Also I tried to print the whole tag itself but didn't find a way to print the whole tag instead of just the tag name. Retrieving Xpath and the whole tag itself is the biggest challenge currently, Thank You!
Code:
from bs4 import BeautifulSoup
from lxml import etree
doc = "<p><a></a><a></a>Printable String</p>"
soup = BeautifulSoup(doc, "lxml")
root = etree.fromstring(str(soup))
tree = etree.ElementTree(root)
for i, e in enumerate(root.iter()):
print(e.text)
Output:
None
None
None
None
None
[Finished in 0.2s]
Expected Output:
None
None
Printable String
None
None
Upvotes: 2
Views: 95
Reputation: 24928
A couple of things to notice:
First, for some reason you parse doc
first with soup
and then again parse the string of soup
with lxml. The first problem is that BS doesn't leave the string along. If you
print(soup)
the output is
<html><body><p><a></a><a></a>Printable String</p></body></html>
You will notice two new elements (html
and body
) are now added, which explains why you get five None
s instead of only three.
If you parse doc
directly with lxml like so and use xpath:
doc = "<p><a></a><a></a>Printable String</p>"
root = etree.fromstring(doc)
for z in root.xpath('//*'):
print(z.xpath('text()'))
Output is
['Printable String']
[]
[]
Upvotes: 1
Reputation:
It's as simple as:-
from bs4 import BeautifulSoup
doc = "<p><a></a><a></a>Printable String</p>"
soup = BeautifulSoup(doc, "lxml")
print(soup.find('p').text)
...or if you want a pure etree solution then:-
from lxml import etree
from io import StringIO
doc = '<p><a></a><a></a>Printable String</p>'
tree = etree.parse(StringIO(doc))
print(tree.xpath('//p/text()')[0])
Upvotes: 0