Reputation: 1
The spider will scrape but the website but I keep on getting the 404 HTTP status code is not handled or not allowed. Is my code fully correct?
I've changed the user-agent in settings.py but it won't fix the problem.
import scrapy
# Creating a new class to implement Spide
class QuuickSpider(scrapy.Spider):
# Spider name
name = 'quick'
# Domain names to scrape
allowed_domains = ['trustpilot.com']
# Base URL for the MacBook air reviews
myBaseUrl = "https://www.trustpilot.com/review/www.quickenloans.com"
start_urls=[]
# Creating list of urls to be scraped by appending page number a the end of base url
for i in range(1,121):
start_urls.append(myBaseUrl+str(i))
# Defining a Scrapy parser
def parse(self, response):
data = response.css('#cm_cr-review_list')
# Collecting product star ratings
star_rating = data.css('.review-rating')
# Collecting user reviews
comments = data.css('.review-text')
count = 0
# Combining the results
for review in star_rating:
yield{'stars': ''.join(review.xpath('.//text()').extract()),
'comment': ''.join(comments[count].xpath(".//text()").extract())
}
count=count+1
Upvotes: 0
Views: 265
Reputation: 1766
Check you start URL https://www.trustpilot.com/review/www.quickenloans.com
.
Try to open it in a browser, that is what you get in code response. It's an invalid URL. Make sure that you have got proper URL to scrape.
Upvotes: 0
Reputation: 1035
404 error means that the site is not found. As the previous answer said you're producing invalid urls with this code. I'm not sure what kind of URLs you are trying to create but let me show you how I would handle something similar. This is the standard way of generating your start urls using the start_requests method. Scrapy start_requests docs[1]
import scrapy
# Creating a new class to implement Spide
class QuuickSpider(scrapy.Spider):
# Spider name
name = 'quick'
# Domain names to scrape -- EDIT: Include .com as its part of the domain www
# www is a subdomain and not needed but .com is part of the base domain name.
allowed_domains = ['trustpilot.com']
start_urls=['https://www.trustpilot.com/review/www.quickenloans.com/{}']
# I don't think appending to start_urls is a great way to go about this I would
# take this approach. start_requests is a built in scrapy method which you can
# override in order to generate your start_urls.
def start_requests(self):
for i in range(1, 121):
link = self.start_urls[0].format(str(i))
yield scrapy.Request(link, callback=self.parse)
# Defining a Scrapy parser
def parse(self, response):
data = response.css('#cm_cr-review_list')
# Collecting product star ratings
star_rating = data.css('.review-rating')
# Collecting user reviews
comments = data.css('.review-text')
count = 0
# Combining the results
for review in star_rating:
yield{'stars': ''.join(review.xpath('.//text()').extract()),
'comment': ''.join(comments[count].xpath(".//text()").extract())
}
count=count+1
[1]: https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.start_requests
Upvotes: 1
Reputation: 3561
This code can produce not valid links:
# Creating list of urls to be scraped by appending page number a the end of base url
for i in range(1,121):
start_urls.append(myBaseUrl+str(i))
result of myBaseUrl+str(i)
will return urls like (without /
symbol):
https://www.trustpilot.com/review/www.quickenloans.com1
https://www.trustpilot.com/review/www.quickenloans.com2
https://www.trustpilot.com/review/www.quickenloans.com3
If you expect to see links like this:
https://www.trustpilot.com/review/www.quickenloans.com/1
https://www.trustpilot.com/review/www.quickenloans.com/2
https://www.trustpilot.com/review/www.quickenloans.com/3
for valind links you need to replace myBaseUrl+str(i)
by myBaseUrl+"/"+str(i)
Upvotes: 1