Slater Victoroff
Slater Victoroff

Reputation: 21914

Scrapy adds %0A to URLs, causing them to fail

I'm almost at my wits end with this one. Basically I have a url that seems to be somehow magical. Specifically it's this:

https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031

When I hit it with requests, everything works fine:

import requests
test = requests.get("https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031")
<Response [200]>

However, when I use scrapy, the following line pops out:

Crawled (404) <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031%0A>

I even tried updating my User-Agent string to no avail. Part of me is concerned that the URL-encoding %0A is responsible, but that seems like it would be rather strange, and I can't find any documentation on how to fix it.

For reference, this is how I'm sending the request, though I'm not sure this will add much information:

for url in review_urls:
    yield scrapy.Request(url, callback=self.get_review_urls)

It's important to note that this is the exception rather than the rule. Most URLs work unhindered, but these edge cases are not uncommon.

Upvotes: 2

Views: 2599

Answers (1)

Jithin
Jithin

Reputation: 1712

I don't think thats the problem with the scrapy, I suspect there is a problem with your review_urls,

plese find this demo from the scrapy-shell, somehow your url has ended with a newline feed, (docs here) during the url-encoding that \n is converted to %0A. Seems like your are accidently added a newline character at the end of the url or the extracted url contains an extra new line feed.

scrapy shell 'https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031'
2015-08-02 05:48:56 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-08-02 05:48:56 [scrapy] INFO: Optional features available: ssl, http11
2015-08-02 05:48:56 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2015-08-02 05:48:56 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, CoreStats, SpiderState
2015-08-02 05:48:56 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-02 05:48:56 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-02 05:48:56 [scrapy] INFO: Enabled item pipelines: 
2015-08-02 05:48:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-08-02 05:48:56 [scrapy] INFO: Spider opened
2015-08-02 05:48:58 [scrapy] DEBUG: Redirecting (302) to <GET http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031> from <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
2015-08-02 05:48:59 [scrapy] DEBUG: Crawled (200) <GET http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fe36d76fbd0>
[s]   item       {}
[s]   request    <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
[s]   response   <200 http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
[s]   settings   <scrapy.settings.Settings object at 0x7fe365b91c50>
[s]   spider     <DefaultSpider 'default' at 0x7fe36420d110>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
2015-08-02 05:48:59 [root] DEBUG: Using default logger
2015-08-02 05:48:59 [root] DEBUG: Using default logger

In [1]: url = 'https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031\n'

In [2]: fetch(url)
2015-08-02 05:49:24 [scrapy] DEBUG: Crawled (404) <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031%0A> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fe36d76fbd0>
[s]   item       {}
[s]   request    <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031%0A>
[s]   response   <404 https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031%0A>
[s]   settings   <scrapy.settings.Settings object at 0x7fe365b91c50>
[s]   spider     <DefaultSpider 'default' at 0x7fe36420d110>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

perform a strip() on the urls before you make a request and will give you the desired results as follows,

In [3]: fetch(url.strip())
2015-08-02 05:53:01 [scrapy] DEBUG: Redirecting (302) to <GET http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031> from <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
2015-08-02 05:53:03 [scrapy] DEBUG: Crawled (200) <GET http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fe36d76fbd0>
[s]   item       {}
[s]   request    <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
[s]   response   <200 http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
[s]   settings   <scrapy.settings.Settings object at 0x7fe365b91c50>
[s]   spider     <DefaultSpider 'default' at 0x7fe36420d110>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

Upvotes: 2

Related Questions