Reputation: 1

How do I fix 404 HTTP status code is not handled or not allowed in scrapy?

The spider will scrape but the website but I keep on getting the 404 HTTP status code is not handled or not allowed. Is my code fully correct?

I've changed the user-agent in settings.py but it won't fix the problem.

import scrapy

# Creating a new class to implement Spide
class QuuickSpider(scrapy.Spider):

    # Spider name
    name = 'quick'

    # Domain names to scrape
    allowed_domains = ['trustpilot.com']

    # Base URL for the MacBook air reviews
    myBaseUrl = "https://www.trustpilot.com/review/www.quickenloans.com"
    start_urls=[]

    # Creating list of urls to be scraped by appending page number a the end of base url
    for i in range(1,121):
        start_urls.append(myBaseUrl+str(i))

    # Defining a Scrapy parser
    def parse(self, response):
            data = response.css('#cm_cr-review_list')

            # Collecting product star ratings
            star_rating = data.css('.review-rating')

            # Collecting user reviews
            comments = data.css('.review-text')
            count = 0

            # Combining the results
            for review in star_rating:
                yield{'stars': ''.join(review.xpath('.//text()').extract()),
                      'comment': ''.join(comments[count].xpath(".//text()").extract())
                     }
                count=count+1

Upvotes: 0

Answers (3)

Arun Augustine

Reputation: 1766

Check you start URL https://www.trustpilot.com/review/www.quickenloans.com.

Try to open it in a browser, that is what you get in code response. It's an invalid URL. Make sure that you have got proper URL to scrape.

Upvotes: 0

ThePyGuy

Reputation: 1035

404 error means that the site is not found. As the previous answer said you're producing invalid urls with this code. I'm not sure what kind of URLs you are trying to create but let me show you how I would handle something similar. This is the standard way of generating your start urls using the start_requests method. Scrapy start_requests docs[1]

import scrapy

# Creating a new class to implement Spide
class QuuickSpider(scrapy.Spider):

    # Spider name
    name = 'quick'

    # Domain names to scrape -- EDIT: Include .com as its part of the domain www
    # www is a subdomain and not needed but .com is part of the base domain name.
    allowed_domains = ['trustpilot.com']
    start_urls=['https://www.trustpilot.com/review/www.quickenloans.com/{}']

    # I don't think appending to start_urls is a great way to go about this I would 
    # take this approach. start_requests is a built in scrapy method which you can
    # override in order to generate your start_urls.

    def start_requests(self):
        for i in range(1, 121):
            link = self.start_urls[0].format(str(i))
            yield scrapy.Request(link, callback=self.parse)


    # Defining a Scrapy parser
    def parse(self, response):
            data = response.css('#cm_cr-review_list')

            # Collecting product star ratings
            star_rating = data.css('.review-rating')

            # Collecting user reviews
            comments = data.css('.review-text')
            count = 0

            # Combining the results
            for review in star_rating:
                yield{'stars': ''.join(review.xpath('.//text()').extract()),
                      'comment': ''.join(comments[count].xpath(".//text()").extract())
                     }
                count=count+1


  [1]: https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.start_requests

Upvotes: 1

Georgiy

Reputation: 3561

This code can produce not valid links:

# Creating list of urls to be scraped by appending page number a the end of base url
for i in range(1,121):
    start_urls.append(myBaseUrl+str(i))

result of myBaseUrl+str(i) will return urls like (without / symbol): https://www.trustpilot.com/review/www.quickenloans.com1 https://www.trustpilot.com/review/www.quickenloans.com2 https://www.trustpilot.com/review/www.quickenloans.com3
If you expect to see links like this:
https://www.trustpilot.com/review/www.quickenloans.com/1 https://www.trustpilot.com/review/www.quickenloans.com/2 https://www.trustpilot.com/review/www.quickenloans.com/3
for valind links you need to replace myBaseUrl+str(i) by myBaseUrl+"/"+str(i)

Upvotes: 1

How do I fix 404 HTTP status code is not handled or not allowed in scrapy?

Answers (3)

Related Questions