Reputation: 2108
I'm new with scrapy and this is my second spider:
class SitenameScrapy(scrapy.Spider):
name = "sitename"
allowed_domains = ['www.sitename.com', 'sitename.com']
rules = [Rule(LinkExtractor(unique=True), follow=True)]
def start_requests(self):
urls = ['http://www.sitename.com/']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_cat)
def parse_cat(self, response):
links = LinkExtractor().extract_links(response)
for link in links:
if ('/category/' in link.url):
yield response.follow(link, self.parse_cat)
if ('/product/' in link.url):
yield response.follow(link, self.parse_prod)
def parse_prod(self, response):
pass
My problem is that sometimes I have links like http://sitename.com/path1/path2/?param1=value1¶m2=value2
and for me, param1 is not important and I want to remove it from url before response.follow
. I think I can do it with regex
but I'm not sure that it is 'right way' for scrapy? Maybe I should use some kind of rule for this?
Upvotes: 1
Views: 832
Reputation: 1548
I think you could use the url_query_cleaner method from w3lib's library. Something like:
from w3lib.url import url_query_cleaner
...
....
def parse_cat(self, response):
links = LinkExtractor().extract_links(response)
for link in links:
url = url_query_cleaner(link.url, ('param2',))
if '/category/' in url:
yield response.follow(url, self.parse_cat)
if '/product/' in url:
yield response.follow(url, self.parse_prod)
Upvotes: 4