kyrenia
kyrenia

Reputation: 5575

Change user agent used with robotparser in Python

I am using the robotparser from the urlib module in Python to determine if can download webpages. One site I am accessing however returns a 403 error when the robot.txt file is accessed via the default user-agent, but correct response if e.g. downloaded via requests with my user-agent string. (The site also gives a 403 when accessed with the requests packages default user-agent, suggesting they are just blocking common/generic user-agent strings, rather than adding them to the robot.txt file).

Anyway, is it possible to change the user-agent in the rootparser module? Or alternatively, to load in a robot.txt file downloaded seperately?

Upvotes: 1

Views: 910

Answers (1)

Daniil Ryzhkov
Daniil Ryzhkov

Reputation: 7596

There is no option to fetch robots.txt with User-Agent using RobotFileParser, but you can fetch it yourself and path an array of strings to the parse() method :

from urllib.robotparser import RobotFileParser
import urllib.request

rp = RobotFileParser()


with urllib.request.urlopen(urllib.request.Request('http://stackoverflow.com/robots.txt',
                                                   headers={'User-Agent': 'Python'})) as response:
   rp.parse(response.read().decode("utf-8").splitlines())

print(rp.can_fetch("*", "http://stackoverflow.com/posts/"))

Upvotes: 7

Related Questions