andres.gtz
andres.gtz

Reputation: 634

How to get all "a" tags containing a certain "href" format using Python?

I am trying to get all links from a website using XPATH, the URL format is pretty specific but dynamic.

The URL I'd like to get has the format of "/static_word/random-string-with-dashes/random_number" (3 segments: 1st static, 2nd random string, 3rd random number). Can you guys help me to accomplish this?

I was trying to do it with regex but it did not work.

Here is my code:

from lxml import html
import ssl
import requests
ssl._create_default_https_context = ssl._create_unverified_context
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
myRequest = requests.get("https://somesecureurl.com/", headers=headers)
webpage = html.fromstring(myRequest.content)
theLinks = webpage.xpath("//a[contains(@href,'^/static_word/[A-Za-z0-9_-]/[0-9]$')]")

print(theLinks)

Upvotes: 1

Views: 35

Answers (1)

Andersson
Andersson

Reputation: 52675

There is a matches() which you can use to match required string by regex:

//a[matches(@href,'^/static_word/[A-Za-z0-9_-]+/[0-9]+$')]

but AFAIK lxml doesn't support XPath 2.0 functions

You can try this one instead:

//a[starts-with(@href, '/static_word/') and 
    (string-length(@href)-string-length(translate(@href, '/', '')))=3 and
    number(substring-after(substring-after(@href, '/static_word/'), '/'))>=0]

Above predicate should match:

  • starts-with(@href, "/static_word/") - a node with @href that starts with substring '/static_word/'
  • (string-length(@href)-string-length(translate(@href, '/', '')))=3 - also @href contains exactly 3 slashes
  • number(substring-after(substring-after(@href, '/static_word/'), '/'))>=0 - the last sub-string is any positive number

This looks awful, but should work :)

Upvotes: 2

Related Questions