Reputation: 108
I've been trying to scrape a website like GitHub that requires login authentication, but unlike Github, it does not have and an API. I've followed these instructions and many others, but nothing seems to work and just returns a 422 error.
from lxml import html
url = "https://github.com/login"
user = "my email"
pas = "associated password"
sess = requests.Session()
r = sess.get(url)
rhtml = html.fromstring(r.text)
#get all hidden input fields and make a dict of them
hidden = rhtml.xpath(r'//form//input[@type="hidden"]')
form = {x.attrib["name"]: x.attrib["value"] for x in hidden}
#add login creds to the dict
form['login'] = user
form['password'] = pas
#post
res = sess.post(url, data=form)
print(res)
# <Response [422]>
I've also tried just sess.post(url, data={'login':user, 'password':pas})
with the same result. get
ing the cookies first and using them in the post doesn't seem to work either.
How can i get my login page, preferably without using Selenium?
Upvotes: 3
Views: 1495
Reputation: 8077
That's because the form action
is different from the login page.
This is how you can do it using requests
and BeautifulSoup
:
import requests
from bs4 import BeautifulSoup
url = "https://github.com/login"
user = "<username>"
pwd = "<password>"
with requests.Session() as s:
r = s.get(url)
soup = BeautifulSoup(r.content, "lxml")
hidden = soup.find_all("input", {'type':'hidden'})
target = "https://github.com" + soup.find("form")['action']
payload = {x["name"]: x["value"] for x in hidden}
#add login creds to the dict
payload['login'] = user
payload['password'] = pwd
r = s.post(target, data=payload)
print(r)
Upvotes: 2