Helvdan
Helvdan

Reputation: 436

Crawl non-latin domain with scrapy

I need to crawl some websites in ".рф" domain zone with scrapy. Url has structure like this: "http://сайтдляпримера.рф" (this url is not real, it's given for example). Of course website I try to work with is accessible with browser. I tried to use the start_urls property to begin with crawling, eg.:

start_urls = ['http://сайтдляпримера.рф']

And also start_requests function:

def start_requests(self):
    return [scrapy.Request("http://сайтдляпримера.рф/", callback=self._test)]

Neither of them worked as expected, I got following console message:

2016-01-01 19:02:01 [scrapy] INFO: Spider opened
2016-01-01 19:02:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-01 19:02:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-01 19:02:01 [scrapy] DEBUG: Retrying <GET http://%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84> (failed 1 times): DNS lookup failed: address '%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84' not found: [Errno -2] Name or service not known.
2016-01-01 19:02:01 [scrapy] DEBUG: Retrying <GET http://%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84> (failed 2 times): DNS lookup failed: address '%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84' not found: [Errno -2] Name or service not known.
2016-01-01 19:02:01 [scrapy] DEBUG: Gave up retrying <GET http://%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84> (failed 3 times): DNS lookup failed: address '%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84' not found: [Errno -2] Name or service not known.
2016-01-01 19:02:01 [scrapy] ERROR: Error downloading <GET http://%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84>: DNS lookup failed: address '%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84' not found: [Errno -2] Name or service not known.
2016-01-01 19:02:01 [scrapy] INFO: Closing spider (finished)

*If it is matters, I need to use scrapy on Linux-based OS.

Are there any solutions? If possible is there a way I can solve this from the _spider file due to I don't have access to framework's repository (nothing that handles http requests was modified there)

Upvotes: 3

Views: 552

Answers (1)

vrs
vrs

Reputation: 1982

When dealing with Internationalized Domain Names (IDN) you need to encode non-ascii urls with idna. You will need to decode resulting bytes into a unicode string then. Also note that an ascii substring of your url making up a protocol name ('http://') should be prefixed separately, so that you are not messing up when doing idna encoding:

'http://' + u'сайтдляпримера.рф'.encode('idna').decode('utf-8')

See also this document for more details.

Upvotes: 2

Related Questions