How to get URL out of href that is itself a hyperlink?

Question

I'm using Python and lxml to try to scrape this html page. The problem I'm running into is trying to get the URL out of this hyperlink text "Chapter02a". (Note that I can't seem to get the link formatting to work here).

Examples of Operations

I have tried

//ol[@id="ProbList"]/li/a/@href

but that only gives me the text "Chapter02a".

Also:

//ol[@id="ProbList"]/li/a

This returns a lxml.html.HtmlElement'object, and none of the properties that I found in the documentation accomplish what I'm trying to do.

from lxml import html
import requests

chapter_req = requests.get('https://www.math.wisc.edu/~mstemper2/Math/Pinter/Chapter02')
chapter_html = html.fromstring(chapter_req.content)
sections = chapter_html.xpath('//ol[@id="ProbList"]/li/a/@href')
print(sections[0])

I want sections to be a list of URLs to the subsections.

James · Accepted Answer

The return you are seeing is correct because Chapter02a is a "relative" link to the next section. The full url is not listed because that is not how it is stored in the html.

To get the full urls you can use:

url_base = 'https://www.math.wisc.edu/~mstemper2/Math/Pinter/'
sections = chapter_html.xpath('//ol[@id="ProbList"]/li/a/@href')
section_urls = [url_base + s for s in sections]

How to get URL out of href that is itself a hyperlink?

Answers (2)

Related Questions