Ishmael
Ishmael

Reputation: 13

Error while obtaining start requests with Scrapy

I am having some trouble trying to scrape through these 2 specific pages and don't really see where the problem is. If you have any ideas or advices I am all ears ! Thanks in advance !

import scrapy


class SneakersSpider(scrapy.Spider):
    name = "sneakers"
    
    def start_requests(self):
        headers = {'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
        urls = [ 
            #"https://stockx.com/fr-fr/retro-jordans",
            "https://stockx.com/fr-fr/retro-jordans?page=2",
            "https://stockx.com/fr-fr/retro-jordans?page=3",
            ]
        for url in urls:
            yield scrapy.Request(url = url, callback =self.parse ,headers = headers)
            
    def parse(self,response):
        page = response.url.split("=")[-1]
        filename = f'sneakers-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file {filename}')
            
            
            


Upvotes: 1

Views: 1050

Answers (1)

stranac
stranac

Reputation: 28236

Looking at the traceback always helps. You should see something like this in your spider's output:

Traceback (most recent call last):
  File "c:\program files\python37\lib\site-packages\scrapy\core\engine.py", line 127, in _next_request
    request = next(slot.start_requests)
  File "D:\Users\Ivan\Documents\Python\a.py", line 15, in start_requests
    yield scrapy.Request(url = url, callback =self.parse ,headers = headers)
  File "c:\program files\python37\lib\site-packages\scrapy\http\request\__init__.py", line 39, in __init__
    self.headers = Headers(headers or {}, encoding=encoding)
  File "c:\program files\python37\lib\site-packages\scrapy\http\headers.py", line 12, in __init__
    super(Headers, self).__init__(seq)
  File "c:\program files\python37\lib\site-packages\scrapy\utils\datatypes.py", line 193, in __init__
    self.update(seq)
  File "c:\program files\python37\lib\site-packages\scrapy\utils\datatypes.py", line 229, in update
    super(CaselessDict, self).update(iseq)
  File "c:\program files\python37\lib\site-packages\scrapy\utils\datatypes.py", line 228, in <genexpr>
    iseq = ((self.normkey(k), self.normvalue(v)) for k, v in seq)
ValueError: too many values to unpack (expected 2)

As you can see, there is a problem in the code that handles request headers.

headers is a set in your code; it should be a dict instead.
This works without a problem:

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}

Another way to set a default user agent for all requests is using the USER_AGENT setting.

Upvotes: 1

Related Questions