user11642562
user11642562

Reputation:

adding headers to scrapy?

I have the following code for webscraping written on python/scrapy:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.crawler import CrawlerProcess
import requests

class HousesearchspiderSpider(scrapy.Spider):
    name = "housesearchspider"
    user_agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'
    download_delay = 10.0
    start_urls = [
        'https://www.website.com/filter1/filter2/',
    ]

        for detail in response.css('div.search-result-content'):

            yield {'price':detail.css('div.search-result-info search-result-info-price ::text').get(),
                   'size': detail.css('ul.search-result-kenmerken ::text').get(),
                   'postcode': detail.css('small.search-result-subtitle ::text').get(),
                   'street': detail.css('h2.search-result-title ::text').get(),
                   }

        next_page = response.css('li.next a::attr(href)').get()

        if next_page is not None:
            next_page = response.urljoin(next_page)
            sleep(5)
            yield scrapy.Request(next_page, callback=self.parse)

but I am getting blocked using that user_agent and would like to add a header and a yield scrapy.Request(url, headers = headers) to emulate the exact same request as a real browser would (kind of like the following beautiful soup code does, but in scrapy):

response = get(url, headers=headers)

I can't find much documentation/examples of where exactly to include this header in scrapy? Can someone help?

Upvotes: 0

Views: 2263

Answers (2)

Ryan Gedwill
Ryan Gedwill

Reputation: 118

scrapy.Request now includes a cookies parameter, don't use headers for them because they won't get picked up by the middleware.

request_with_cookies = Request(url="http://www.example.com",
                           cookies={'currency': 'USD', 'country': 'UY'})

https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request

Upvotes: 0

gangabass
gangabass

Reputation: 10666

For your start_urls request you can use settings.py: USER_AGENT and DEFAULT_REQUEST_HEADERS

For each request you gonna yield from your code you can use headers keyword:

yield scrapy.Request(next_page, headers=you_headers, callback=self.parse)

Upvotes: 0

Related Questions