Manoj Kumar
Manoj Kumar

Reputation: 757

Scrapy Request URL going wrong

I'm Using Scrapy to crawl a site

My problem is when I extract url from href i'm getting %20 in the url. So, to remove that i used split and got my desired url

For Example :

Original URL : http://www.example.com/category/%20

My modified URL looks like : http://www.example.com/category/

So I'm giving my modified url to Request method, but still request method is taking original url not a modified url

My parse and extract methods are below

def parse(self, response):
    sel = Selector(response)
    requests = []

    # Get Product Reviews
    for url in sel.xpath('//div[contains(@id,"post")]/div/div[2]/h3/a/@href').extract():
        url = url.encode('utf-8').split('%')[0]
        requests.append(Request(url, callback=self.extract))

    for request in requests:
        print request.url
        yield request
        
def extract(self, response):
    sel = Selector(response)
    requestedItem = ProductItem()
    requestedItem['name'] = sel.xpath('//*[@id="content-wrapper"]/div/div[1]/div[1]/div/div/h1/text()').extract()[0].encode('utf-8')
    requestedItem['description'] = sel.xpath('//*[@id="content-wrapper"]/div/div[1]/div[2]/div/div/div[1]/p/text()').extract()[0].encode('utf-8')
    
    yield requestedItem

So, Please any one help me in resolving this issue

Upvotes: 2

Views: 1093

Answers (1)

GHajba
GHajba

Reputation: 3691

Please take a look at the following answer (and the related question): Scrapy: URL error, Program adds unnecessary characters(URL-codes)

As you can see there whitespace is added to the URL. For this you could either normalize-space when you select the URL or simply strip it before you yield the request.

That's because %20 is a single space -- which is only escaped when you will call the URL and you do not see %20 at the end of your URL.

So instead of using

url = url.encode('utf-8').split('%')[0]

You can either

for url in sel.xpath('normalize-space(//div[contains(@id,"post")]/div/div[2]/h3/a/@href)').extract():
    requests.append(Request(url, callback=self.extract))

or

for url in sel.xpath('//div[contains(@id,"post")]/div/div[2]/h3/a/@href').extract():
    requests.append(Request(url.strip(), callback=self.extract))

Upvotes: 4

Related Questions