Python - trying to get URL (href) from web scraping using Scrapy

Question

I'm trying to get the URL, or href, from a webpage using web scraping, specifically using Scrapy. However, it returns an empty list when I response.xpath('XPATH').extract() the href link. The HTML page structure is: The specific HTML element href I'm trying to get is: MAGOMEDOVA MADINA

The result of the xpath command is:

For context, I'm trying to get the information in each person's URL and extract it, but I'm unable to retrieve the href from the web page.

I copied the full xpath of the HTML element, and it's: /html/body/div1/div1/div[6]/div/div2/div/div2/div2/div/div2/div/div/div2/div1/a.

But this still returns [] when I run response xpath command.

stidmatt · Accepted Answer

In this situation I personally wouldn't use xpath. I wouldn't even use Scrapy. In this situation I believe the simplest solution would be to instead use BeautifulSoup and requests together.

import BeautifulSoup as bs4
import requests
url=YOUR_URL_HERE
soup=BeautifulSoup(requests.get(url).text)
links=soup.find_all('a')
urls=[x['href'] for x in links]

This code will give you the href of every link on the page in a list, and you can filter the list further by the class or whatever you need.

Python - trying to get URL (href) from web scraping using Scrapy

Answers (2)

Related Questions