Reputation: 1493
I'm trying to find all images (.png, .bmp, .jpg) and executables (.exe) from anchor links using lxml. From this similar thread, the accepted answer suggests doing something like this:
png = tree.xpath("//div/ul/li//a[ends-with(@href, '.png')]")
bmp = tree.xpath("//div/ul/li//a[ends-with(@href, '.bmp')]")
jpg = tree.xpath("//div/ul/li//a[ends-with(@href, '.jpg')]")
exe = tree.xpath("//div/ul/li//a[ends-with(@href, '.exe')]")
However, I get keep getting this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 2095, in lxml.etree._ElementTree.xpath (src/lxml/lxml.etree.c:53597)
File "xpath.pxi", line 373, in lxml.etree.XPathDocumentEvaluator.__call__ (src/lxml/lxml.etree.c:134052)
File "xpath.pxi", line 241, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:132625)
File "xpath.pxi", line 226, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:132453)
lxml.etree.XPathEvalError: Unregistered function
I'm running lxml 3.2.4 through pip.
Also, instead of defining the xpath 4 times for each file extension, is there a way to use xpath and specify all four file extensions at once?
Upvotes: 1
Views: 2331
Reputation: 9816
ends-with
is a function defined for XPath 2.0, XQuery 1.0 and XSLT 2.0 while lxml only supports XPath 1.0, XSLT 1.0 and the EXSLT extensions. So you couldn't use this function. The doc is here and here.
You could use regular expressions in XPATH. The following is a sample code which returns nodes matching the regular expressions:
regexpNS = 'http://exslt.org/regular-expressions'
tree.xpath("//a[re:test(@href, '(png|bmp|jpg|exe)$')]", namespaces={'re':regexpNS}")
Here is a similar question Python, XPath: Find all links to images and regular-expressions-in-xpath
Upvotes: 3
Reputation: 5942
I think that is a problem with an external library not recognizing the ends-with
function. The documentation discusses working with links. I think a nicer solution would be something like:
from urlparse import urlparse
tree.make_links_absolute(base_href='http://example.com/')
links = []
for i in tree.iterlinks():
url = urlparse(i[2]) # ensures you are getting the remote file path
if url.path.endswith('.png') or url.path.endswith('.exe') ... :
# there are other ways you could filter the links here
links.append(i[2])
Upvotes: 0