Python Scrapy get absolute url using input processor

Question

I'm trying to create an input processor to convert scraped relative urls to absolute urls, based on this Stackoverflow post. I'm struggling with the loader_context concept and I'm probably mixing things up here. Could anyone point me in the right direction?

I have the following in items.py

import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose
from urlparse import urljoin

def convert_to_baseurl(url, loader_context):
    response = loader_context.get('response')
    return urljoin(url, response)


class Item(scrapy.Item):
    url = scrapy.Field(
        input_processor=MapCompose(convert_to_baseurl)
    )

And the following in my spider

class webscraper(scrapy.Spider):
    name = "spider"

    def start_requests(self):
        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):

        for entry in response.css('li.aanbodEntry'):
            loader = ItemLoader(item=Huis(), selector=entry)
            loader.add_css('url', 'a')

            yield loader.load_item()

stranac · Accepted Answer

The _urljoin() in the answer you referenced is a function written by the OP, and it has a different signature than the one in the stdlib.
The correct way to use the stdlib urljoin() would be:

return urljoin(response.url, url)

There is no need to use that however, since you can use Response.urljoin() :

def absolute_url(url, loader_context):
    return loader_context['response'].urljoin(url)

For the response to be accessible through the context attribute, you need to pass it as an argument when creating the item loader, or use a different method mentioned in item loader docs:

loader = ItemLoader(item=Huis(), selector=entry, response=response)

Python Scrapy get absolute url using input processor

Answers (1)

Related Questions