Reputation: 428
I am trying use scrapy in one project. I have trouble bypassing the authentication system of https://text.westlaw.com/signon/default.wl?RS=ACCS10.10&VR=2.0&newdoor=true&sotype=mup . To understand the issue, I did a simple request handler.
import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36'),]
url='https://text.westlaw.com/signon/default.wl?RS=ACCS10.10&VR=2.0&newdoor=true&sotype=mup'
r = opener.open(url)
f = open('code.html', 'wb')
f.write(r.read())
f.close()
The html code returned contains no form elements. May be someone know how to convince the server, that I am not a fake browser, so I can go on with authentication?
Upvotes: 0
Views: 1529
Reputation: 453
You can use InitSpider
, which allows you to do some Post processing, such as logging in with a custom handler:
class CrawlpySpider(InitSpider):
#...
# Make sure to add the logout page to the denied list
rules = (
Rule(
LinkExtractor(
allow_domains=(self.allowed_domains),
unique=True,
deny=('logout.php'),
),
callback='parse',
follow=True
),
)
def init_request(self):
"""This function is called before crawling starts."""
# Do a login
return Request(url="http://domain.tld/login.php", callback=self.login)
def login(self, response):
"""Generate a login request."""
return FormRequest.from_response(
response,
formdata={
"username": "admin",
"password": "very-secure",
"reguired-field": "my-value"
},
method="post",
callback=self.check_login_response
)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "incorrect password" not in response.body:
# Now the crawling can begin..
logging.info('Login successful')
return self.initialized()
else:
# Something went wrong, we couldn't log in, so nothing happens.
logging.error('Unable to login')
def parse(self, response):
"""Your stuff here"""
I have also just implemented a working example, which does exactly what you are trying to achieve. Have a look at it: https://github.com/cytopia/crawlpy
Upvotes: 2