Changing Scrapy/Splash user agent

How can I set the user agent for Scrapy with Splash in an equivalent way like below:

import requests
from bs4 import BeautifulSoup

ua = {"User-Agent":"Mozilla/5.0"}
url = "http://www.example.com"
page = requests.get(url, headers=ua)
soup = BeautifulSoup(page.text, "lxml")

My spider would look similar to this:

import scrapy
from scrapy_splash import SplashRequest


class ExampleSpider(scrapy.Spider):
        name = "example"
        allowed_domains = ["example.com"]
        start_urls = ["https://www.example.com/"]

        def start_requests(self):
            for url in self.start_urls:
                yield SplashRequest(
                    url,
                    self.parse,
                    args={'wait': 0.5}
                )

Upvotes: 6

Answers (3)

Tranfer Will

Reputation: 178

If you use pure splash (not scrapy-splash package), you can just pass headers param with 'User-Agent' key. And the requests on this page all will use this user-agent.

https://splash.readthedocs.io/en/stable/api.html?highlight=User-Agent

Here is an example:

import requests
import json

headers = {
    'User-Agent': 'Mozilla/5.0',
}
param = {
    'url': your_aim_url,
    'headers': headers,
    'html': 1,
    'har': 1,
    'response_body': 1,
}
session = requests.Session()
session.headers.update({'Content-Type': 'application/json'})
response = self.session.post(url='http://127.0.0.1:8050/render.json', json=param)
response_json = json.loads(response.text, encoding='utf-8')
print(response_json.get('html'))  # page html
print(response_json.get('har'))  # har with respose body. if do not want respose body, set 'response_body' to 0

You can check the request header in har to see if the user-agent is correct.

Upvotes: 0

scriptso

Reputation: 677

The proper way is to to alter the splash script to included it... no add it to the spider though, if it works as well.

http://splash.readthedocs.io/en/stable/scripting-ref.html?highlight=agent

Upvotes: 7

skovorodkin

Reputation: 10264

You need to set user_agent attribute to override default user agent:

class ExampleSpider(scrapy.Spider):
    name = 'example'
    user_agent = 'Mozilla/5.0'

In this case UserAgentMiddleware (which is enabled by default) will override USER_AGENT setting value to 'Mozilla/5.0'.

You can also override headers per request:

scrapy_splash.SplashRequest(url, headers={'User-Agent': custom_user_agent})

Upvotes: 8

Changing Scrapy/Splash user agent

Answers (3)

Related Questions