Reputation: 159
<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" class="pc chrome win psc_dir-ltr psc_form-xlarge" dir="ltr" lang="en">
<title>Some Title</title>
</html>
if I run:
from lxml import etree
html = etree.parse('text.txt')
result = html.xpath('//title')
print(result)
I will get an empty list. I guess it has something to do with namespace, but I can't figure it out how to fix it.
Upvotes: 6
Views: 7285
Reputation: 425
You can use the namespaces
parameter of the xpath
method like this:
from lxml import etree
html = etree.parse('text.txt')
result = html.xpath('//n:title', namespaces = {'n': 'http://www.w3.org/1999/xhtml'})
According to the lxml documentation "[...] XPath does not have a notion of a default namespace. The empty prefix is therefore undefined for XPath and cannot be used in namespace prefix mappings", so if you are working with an element that has a default namespace you can explicitly define the namespace when calling xpath
.
For more information see this similar question with a great answer.
Upvotes: 1
Reputation: 1384
Your can do like this:
from lxml import etree
parser = etree.HTMLParser()
html = etree.parse('text.txt',parser)
result = html.xpath('//title/text()')
print(result)
The output is:
['Some Title']
Upvotes: 1
Reputation: 1579
Try creating the tree using the html parser.
Also note that if text.txt
is a file it will need to be read first.
with open('text.txt', 'r', encoding='utf8') as f:
text_html = f.read()
like this:
from lxml import etree, html
def build_lxml_tree(_html):
tree = html.fromstring(_html)
tree = etree.ElementTree(tree)
return tree
tree = build_lxml_tree(text_html)
result = tree.xpath('//title')
print(result)
Upvotes: 2
Reputation: 13317
You can also use the HTML parser :
from lxml import etree
parser = etree.HTMLParser()
html = etree.parse('text.txt',parser)
result = html.xpath('//title')
print(result)
Upvotes: 1