Zin Yosrim
Zin Yosrim

Reputation: 1684

Changing Scrapy/Splash user agent

How can I set the user agent for Scrapy with Splash in an equivalent way like below:

import requests
from bs4 import BeautifulSoup

ua = {"User-Agent":"Mozilla/5.0"}
url = "http://www.example.com"
page = requests.get(url, headers=ua)
soup = BeautifulSoup(page.text, "lxml")

My spider would look similar to this:

import scrapy
from scrapy_splash import SplashRequest


class ExampleSpider(scrapy.Spider):
        name = "example"
        allowed_domains = ["example.com"]
        start_urls = ["https://www.example.com/"]

        def start_requests(self):
            for url in self.start_urls:
                yield SplashRequest(
                    url,
                    self.parse,
                    args={'wait': 0.5}
                )

Upvotes: 6

Views: 6785

Answers (3)

Tranfer Will
Tranfer Will

Reputation: 178

If you use pure splash (not scrapy-splash package), you can just pass headers param with 'User-Agent' key. And the requests on this page all will use this user-agent.

https://splash.readthedocs.io/en/stable/api.html?highlight=User-Agent

Here is an example:

import requests
import json

headers = {
    'User-Agent': 'Mozilla/5.0',
}
param = {
    'url': your_aim_url,
    'headers': headers,
    'html': 1,
    'har': 1,
    'response_body': 1,
}
session = requests.Session()
session.headers.update({'Content-Type': 'application/json'})
response = self.session.post(url='http://127.0.0.1:8050/render.json', json=param)
response_json = json.loads(response.text, encoding='utf-8')
print(response_json.get('html'))  # page html
print(response_json.get('har'))  # har with respose body. if do not want respose body, set 'response_body' to 0

You can check the request header in har to see if the user-agent is correct.

Upvotes: 0

scriptso
scriptso

Reputation: 677

The proper way is to to alter the splash script to included it... no add it to the spider though, if it works as well.

enter image description here

http://splash.readthedocs.io/en/stable/scripting-ref.html?highlight=agent

Upvotes: 7

skovorodkin
skovorodkin

Reputation: 10264

You need to set user_agent attribute to override default user agent:

class ExampleSpider(scrapy.Spider):
    name = 'example'
    user_agent = 'Mozilla/5.0'

In this case UserAgentMiddleware (which is enabled by default) will override USER_AGENT setting value to 'Mozilla/5.0'.

You can also override headers per request:

scrapy_splash.SplashRequest(url, headers={'User-Agent': custom_user_agent})

Upvotes: 8

Related Questions