GhostKU
GhostKU

Reputation: 2108

How to modify url before following it in scrapy?

I'm new with scrapy and this is my second spider:

class SitenameScrapy(scrapy.Spider):
    name = "sitename"
    allowed_domains = ['www.sitename.com', 'sitename.com']
    rules = [Rule(LinkExtractor(unique=True), follow=True)]

    def start_requests(self):
        urls = ['http://www.sitename.com/']
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse_cat)

    def parse_cat(self, response):
        links = LinkExtractor().extract_links(response)
        for link in links:
            if ('/category/' in link.url):
                yield response.follow(link, self.parse_cat)
            if ('/product/' in link.url):
                yield response.follow(link, self.parse_prod)

    def parse_prod(self, response):
        pass

My problem is that sometimes I have links like http://sitename.com/path1/path2/?param1=value1&param2=value2 and for me, param1 is not important and I want to remove it from url before response.follow. I think I can do it with regex but I'm not sure that it is 'right way' for scrapy? Maybe I should use some kind of rule for this?

Upvotes: 1

Views: 832

Answers (1)

Wilfredo
Wilfredo

Reputation: 1548

I think you could use the url_query_cleaner method from w3lib's library. Something like:

from w3lib.url import url_query_cleaner
...
....
    def parse_cat(self, response):
        links = LinkExtractor().extract_links(response)
        for link in links:
            url = url_query_cleaner(link.url, ('param2',))
            if '/category/' in url:
                yield response.follow(url, self.parse_cat)
            if '/product/' in url:
                yield response.follow(url, self.parse_prod)

Upvotes: 4

Related Questions