Maciek Gierwatowski
Maciek Gierwatowski

Reputation: 26

Scrapy can't follow url with commas without encoding it

Can I force scrapy to request an URL including commas without encoding it into %2C? The site (phorum) I want to crawl does not accept encoded URLs and redirecting me into root.

So, for example, I have site to parse: example.phorum.com/read.php?12,8

The url is being encoded into: example.phorum.com/read.php?12%2C8=

But when try to request this url, every time, I'm redirected into page with list of topics:

example.phorum.com/list.php?12

In those example URLs 12 is category number, 8 is topic number.

I tried to disable redirecting by disabling RedirectMiddleware:

    DOWNLOADER_MIDDLEWARES = {
        'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None,
    }

and in spider:

handle_httpstatus_list = [302, 403]

Moreover I tried to rewrite this URL and request it by sub parser:

    Rules = [Rule(RegexLinkExtractor(allow=[r'(.*%2C.*)']), follow=True, callback='prepare_url')]
    def prepare_url(self, response):
        url = response.url
        url = re.sub(r'%2C', ',', url)
        if "=" in url[-1]:
            url = url[:-1]
        yield Request(urllib.unquote(url), callback = self.parse_site)

Where parse_site is target parser, which still calls using encoded URL.

Thanks in advance for any feedback

Upvotes: 0

Views: 299

Answers (1)

paul trmbrth
paul trmbrth

Reputation: 20748

You can try canonicalize=False. Example iPython session:

In [1]: import scrapy
In [2]: from scrapy.contrib.linkextractors.regex import RegexLinkExtractor   
In [3]: hr = scrapy.http.HtmlResponse(url="http://example.phorum.com", body="""<a href="http://example.phorum.com/list.php?1,2">link</a>""")
In [4]: lx = RegexLinkExtractor(canonicalize=False)
In [5]: lx.extract_links(hr)
Out[5]: [Link(url='http://example.phorum.com/list.php?1,2', text=u'link', fragment='', nofollow=False)]

Upvotes: 2

Related Questions