Alexander Nazareth
Alexander Nazareth

Reputation: 23

How to get URL out of href that is itself a hyperlink?

I'm using Python and lxml to try to scrape this html page. The problem I'm running into is trying to get the URL out of this hyperlink text "Chapter02a". (Note that I can't seem to get the link formatting to work here).

<li><a href="[Chapter02A](https://www.math.wisc.edu/~mstemper2/Math/Pinter/Chapter02A)">Examples of Operations</a></li>

I have tried

//ol[@id="ProbList"]/li/a/@href

but that only gives me the text "Chapter02a".

Also:

//ol[@id="ProbList"]/li/a

This returns a lxml.html.HtmlElement'object, and none of the properties that I found in the documentation accomplish what I'm trying to do.

from lxml import html
import requests

chapter_req = requests.get('https://www.math.wisc.edu/~mstemper2/Math/Pinter/Chapter02')
chapter_html = html.fromstring(chapter_req.content)
sections = chapter_html.xpath('//ol[@id="ProbList"]/li/a/@href')
print(sections[0])

I want sections to be a list of URLs to the subsections.

Upvotes: 2

Views: 109

Answers (2)

Allan
Allan

Reputation: 12438

You can also do the concatenation directly at the XPATH level to regenerate the URL from the relative link:

from lxml import html
import requests

chapter_req = requests.get('https://www.math.wisc.edu/~mstemper2/Math/Pinter/Chapter02')
chapter_html = html.fromstring(chapter_req.content)
sections = chapter_html.xpath('concat("https://www.math.wisc.edu/~mstemper2/Math/Pinter/",//ol[@id="ProbList"]/li/a/@href)')
print(sections)

output:

https://www.math.wisc.edu/~mstemper2/Math/Pinter/Chapter02A

Upvotes: 1

James
James

Reputation: 36608

The return you are seeing is correct because Chapter02a is a "relative" link to the next section. The full url is not listed because that is not how it is stored in the html.

To get the full urls you can use:

url_base = 'https://www.math.wisc.edu/~mstemper2/Math/Pinter/'
sections = chapter_html.xpath('//ol[@id="ProbList"]/li/a/@href')
section_urls = [url_base + s for s in sections]

Upvotes: 1

Related Questions