Reputation: 51
I'm trying to scrape some data from allabolag.se. I want to follow the links at e.g. http://www.allabolag.se/5565794400/befattningar but scrapy does not get the links correctly. It lacks "52" right after "%2" in the URL.
Example, I want to go to: http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan/f6da68933af6383498691f19de7ebd4b
But scrapy gets to following link: http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan/f6da68933af6383498691f19de7ebd4b
I read on this site that it got something to do with encodings: https://www.owasp.org/index.php/Double_Encoding
How do I get around this?
My code is as follows:
# -*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from allabolag.items import AllabolagItem
from scrapy.loader.processors import Join
class allabolagspider(CrawlSpider):
name="allabolagspider"
# allowed_domains = ["byralistan.se"]
start_urls = [
"http://www.allabolag.se/5565794400/befattningar"
]
rules = (
Rule(LinkExtractor(allow = "http://www.allabolag.se", restrict_xpaths=('//*[@id="printContent"]//a[1]')), callback='parse_link'),
)
def parse_link(self, response):
for sel in response.xpath('//*[@id="printContent"]'):
item = AllabolagItem()
item['Byra'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
item['Namn'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
item['Gender'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
item['Alder'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
yield item
Upvotes: 1
Views: 299
Reputation: 20748
You can configure the link extractor to not canonicalize the URLs by passing canonicalize=False
Scrapy shell session:
$ scrapy shell http://www.allabolag.se/5565794400/befattningar
>>> from scrapy.linkextractors import LinkExtractor
>>> le = LinkExtractor()
>>> for l in le.extract_links(response):
... print l
...
(...stripped...)
Link(url='http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b', text=u'', fragment='', nofollow=False)
(...stripped...)
>>> fetch('http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b')
2016-03-02 11:48:07 [scrapy] DEBUG: Crawled (404) <GET http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b> (referer: None)
>>>
>>> le = LinkExtractor(canonicalize=False)
>>> for l in le.extract_links(response):
... print l
...
(...stripped...)
Link(url='http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b', text=u'', fragment='', nofollow=False)
(...stripped...)
>>>
>>> fetch('http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b')
2016-03-02 11:47:42 [scrapy] DEBUG: Crawled (200) <GET http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b> (referer: None)
So you should be good with:
class allabolagspider(CrawlSpider):
name="allabolagspider"
# allowed_domains = ["byralistan.se"]
start_urls = [
"http://www.allabolag.se/5565794400/befattningar"
]
rules = (
Rule(LinkExtractor(allow = "http://www.allabolag.se",
restrict_xpaths=('//*[@id="printContent"]//a[1]'),
canonicalize=False),
callback='parse_link'),
)
...
Upvotes: 2