Reputation: 1183
I'm having difficulties scraping a password protected website. I know there are plenty of questions out there, however, none of them solved my problem.
The problem is, I don't know what the problem is. I do get a 200
response from their server, however, it is not the content I'm expecting. It's indeed a big HTML structure, but there are words like "access", "RequestURLDenied", "Password", "Help", "Sign in", which indicates my login attempt did not work properly. I don't know what to change though? Does someone have experience with scraping?
This is my code so far (extracted from here):
import requests
from lxml import html
USERNAME = "XXX"
PASSWORD = "XXX"
LOGIN_URL = "https://signin.lexisnexis.com/lnaccess/app/signin?back=https%3A%2F%2Fadvance.lexis.com%3A443%2Fnexis-uni%2Flaapi%2Fpermalink%2F35a8b8d7-925d-4219-b89d-af27c10a7a31%2F%3Fcontext%3D1516831&aci=nu"
LOGIN_URL2 = "https://signin.lexisnexis.com:443/lnaccess/Transition?aci=nu"
URL = "https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:7XM6-WXH0-Y9M6-H1V0-00000-00&context=1516831"
def main():
# Create session
session = requests.session()
# Get login cookies
session.get(LOGIN_URL)
# Create payload - used to log into password protected area
login_data = {
"rmtoken": "dummy",
"request_id": "null",
"OAM_REQ": "null",
"userid": USERNAME,
"password": PASSWORD,
"rmflag": "0",
"aci": "nu"
}
# Perform login
session.post(LOGIN_URL, data = login_data)
# Scrape url
result = session.get(URL)
# Content
print(result.content)
if __name__ == '__main__':
main()
Here is what the response looks like when I run the script:
Another question: Say I get to the point where I can login from code and I perform a couple of thousand server requests to extract text, could this cause problems with their server :D?
Upvotes: 1
Views: 876
Reputation: 4783
All in all your code looks correct you just made a few mistakes with the URL you are sending the POST request to and you're using an incomplete payload.
Try the following code:
import requests
from lxml import html
from lxml.etree import tostring
USERNAME = "XXX"
PASSWORD = "XXX"
LOGIN_URL = "https://signin.lexisnexis.com/lnaccess/app/signin?back=https%3A%2F%2Fadvance.lexis.com%3A443%2Fnexis-uni%2Flaapi%2Fpermalink%2F35a8b8d7-925d-4219-b89d-af27c10a7a31%2F%3Fcontext%3D1516831&aci=nu"
URL = "https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:7XM6-WXH0-Y9M6-H1V0-00000-00&context=1516831"
def main():
session_requests = requests.session()
# Get login cookies
session_requests.get(LOGIN_URL)
# Create payload - used to log into password protected are
payload = {
"rmtoken": "dummy",
"request_id": "null",
"OAM_REQ": "null",
"userid": USERNAME,
"password": PASSWORD,
"rmflag": "0",
"aci": "nu"
}
# Perform login
result = session_requests.post("https://signin.lexisnexis.com:443/lnaccess/Transition?aci=nu", data = payload)
# Scrape url
result = session_requests.get(URL)
tree = html.fromstring(result.content)
# bucket_names = tree.xpath("//div[@class='repo-list--repo']/a/text()")
print(tostring(tree))
if __name__ == '__main__':
main()
Hope this helps
Upvotes: 1