Reputation: 757
I'm Using Scrapy to crawl a site
My problem is when I extract url
from href
i'm getting %20
in the url. So, to remove that i used split and got my desired url
For Example :
Original URL : http://www.example.com/category/%20
My modified URL looks like : http://www.example.com/category/
So I'm giving my modified url to Request
method, but still request method is taking original url not a modified url
My parse and extract methods are below
def parse(self, response):
sel = Selector(response)
requests = []
# Get Product Reviews
for url in sel.xpath('//div[contains(@id,"post")]/div/div[2]/h3/a/@href').extract():
url = url.encode('utf-8').split('%')[0]
requests.append(Request(url, callback=self.extract))
for request in requests:
print request.url
yield request
def extract(self, response):
sel = Selector(response)
requestedItem = ProductItem()
requestedItem['name'] = sel.xpath('//*[@id="content-wrapper"]/div/div[1]/div[1]/div/div/h1/text()').extract()[0].encode('utf-8')
requestedItem['description'] = sel.xpath('//*[@id="content-wrapper"]/div/div[1]/div[2]/div/div/div[1]/p/text()').extract()[0].encode('utf-8')
yield requestedItem
So, Please any one help me in resolving this issue
Upvotes: 2
Views: 1093
Reputation: 3691
Please take a look at the following answer (and the related question): Scrapy: URL error, Program adds unnecessary characters(URL-codes)
As you can see there whitespace is added to the URL. For this you could either normalize-space
when you select the URL or simply strip
it before you yield the request.
That's because %20 is a single space -- which is only escaped when you will call the URL and you do not see %20
at the end of your URL.
So instead of using
url = url.encode('utf-8').split('%')[0]
You can either
for url in sel.xpath('normalize-space(//div[contains(@id,"post")]/div/div[2]/h3/a/@href)').extract():
requests.append(Request(url, callback=self.extract))
or
for url in sel.xpath('//div[contains(@id,"post")]/div/div[2]/h3/a/@href').extract():
requests.append(Request(url.strip(), callback=self.extract))
Upvotes: 4