Reputation: 23
I'm using Python and lxml to try to scrape this html page. The problem I'm running into is trying to get the URL out of this hyperlink text "Chapter02a". (Note that I can't seem to get the link formatting to work here).
<li><a href="[Chapter02A](https://www.math.wisc.edu/~mstemper2/Math/Pinter/Chapter02A)">Examples of Operations</a></li>
I have tried
//ol[@id="ProbList"]/li/a/@href
but that only gives me the text "Chapter02a".
Also:
//ol[@id="ProbList"]/li/a
This returns a lxml.html.HtmlElement'object, and none of the properties that I found in the documentation accomplish what I'm trying to do.
from lxml import html
import requests
chapter_req = requests.get('https://www.math.wisc.edu/~mstemper2/Math/Pinter/Chapter02')
chapter_html = html.fromstring(chapter_req.content)
sections = chapter_html.xpath('//ol[@id="ProbList"]/li/a/@href')
print(sections[0])
I want sections to be a list of URLs to the subsections.
Upvotes: 2
Views: 109
Reputation: 12438
You can also do the concatenation directly at the XPATH
level to regenerate the URL from the relative link:
from lxml import html
import requests
chapter_req = requests.get('https://www.math.wisc.edu/~mstemper2/Math/Pinter/Chapter02')
chapter_html = html.fromstring(chapter_req.content)
sections = chapter_html.xpath('concat("https://www.math.wisc.edu/~mstemper2/Math/Pinter/",//ol[@id="ProbList"]/li/a/@href)')
print(sections)
output:
https://www.math.wisc.edu/~mstemper2/Math/Pinter/Chapter02A
Upvotes: 1
Reputation: 36608
The return you are seeing is correct because Chapter02a
is a "relative" link to the next section. The full url is not listed because that is not how it is stored in the html.
To get the full urls you can use:
url_base = 'https://www.math.wisc.edu/~mstemper2/Math/Pinter/'
sections = chapter_html.xpath('//ol[@id="ProbList"]/li/a/@href')
section_urls = [url_base + s for s in sections]
Upvotes: 1