Reputation: 1414
I'm trying to create an input processor to convert scraped relative urls to absolute urls, based on this Stackoverflow post. I'm struggling with the loader_context concept and I'm probably mixing things up here. Could anyone point me in the right direction?
I have the following in items.py
import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose
from urlparse import urljoin
def convert_to_baseurl(url, loader_context):
response = loader_context.get('response')
return urljoin(url, response)
class Item(scrapy.Item):
url = scrapy.Field(
input_processor=MapCompose(convert_to_baseurl)
)
And the following in my spider
class webscraper(scrapy.Spider):
name = "spider"
def start_requests(self):
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for entry in response.css('li.aanbodEntry'):
loader = ItemLoader(item=Huis(), selector=entry)
loader.add_css('url', 'a')
yield loader.load_item()
Upvotes: 1
Views: 880
Reputation: 28256
The _urljoin()
in the answer you referenced is a function written by the OP, and it has a different signature than the one in the stdlib.
The correct way to use the stdlib urljoin()
would be:
return urljoin(response.url, url)
There is no need to use that however, since you can use Response.urljoin() :
def absolute_url(url, loader_context):
return loader_context['response'].urljoin(url)
For the response to be accessible through the context
attribute, you need to pass it as an argument when creating the item loader, or use a different method mentioned in item loader docs:
loader = ItemLoader(item=Huis(), selector=entry, response=response)
Upvotes: 4