Reputation: 525
I need help to convert relative URL to absolute URL in Scrapy spider.
I need to convert links on my start pages to absolute URL to get the images of the scrawled items, which are on the start pages. I unsuccessfully tried different ways to achieve this and I'm stuck. Any suggestion?
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/billboard",
"http://www.example.com/billboard?page=1"
]
def parse(self, response):
image_urls = response.xpath('//div[@class="content"]/section[2]/div[2]/div/div/div/a/article/img/@src').extract()
relative_url = response.xpath(u'''//div[contains(concat(" ", normalize-space(@class), " "), " content ")]/a/@href''').extract()
for image_url, url in zip(image_urls, absolute_urls):
item = ExampleItem()
item['image_urls'] = image_urls
request = Request(url, callback=self.parse_dir_contents)
request.meta['item'] = item
yield request
Upvotes: 21
Views: 10073
Reputation: 811
There are mainly three ways to achieve that:
Using urljoin
function from urllib
:
from urllib.parse import urljoin
# Same as: from w3lib.url import urljoin
url = urljoin(base_url, relative_url)
Using the response's urljoin
wrapper method, as mentioned by Steve.
url = response.urljoin(relative_url)
If you also want to yield a request from that link, you can use the handful response's follow
method:
# It will create a new request using the above "urljoin" method
yield response.follow(relative_url, callback=self.parse)
Upvotes: 37