Reputation: 5575
I am using the robotparser
from the urlib module in Python to determine if can download webpages. One site I am accessing however returns a 403 error when the robot.txt file is accessed via the default user-agent, but correct response if e.g. downloaded via requests with my user-agent string. (The site also gives a 403 when accessed with the requests packages default user-agent, suggesting they are just blocking common/generic user-agent strings, rather than adding them to the robot.txt file).
Anyway, is it possible to change the user-agent in the rootparser module? Or alternatively, to load in a robot.txt file downloaded seperately?
Upvotes: 1
Views: 910
Reputation: 7596
There is no option to fetch robots.txt with User-Agent using RobotFileParser
, but you can fetch it yourself and path an array of strings to the parse()
method :
from urllib.robotparser import RobotFileParser
import urllib.request
rp = RobotFileParser()
with urllib.request.urlopen(urllib.request.Request('http://stackoverflow.com/robots.txt',
headers={'User-Agent': 'Python'})) as response:
rp.parse(response.read().decode("utf-8").splitlines())
print(rp.can_fetch("*", "http://stackoverflow.com/posts/"))
Upvotes: 7