Reputation: 95
I would like to crawl some data from a website. To manually access the target data, I need to log in and then click on some buttons on to finally get the target html page. Currently, I am using the Python request
library to simulate this process. I am doing like this:
ss = requests.session()
#log in
resp = ss.post(url, data = (('username', 'xxx'), ('password', 'xxx')))
#then send requests to the target url
result = ss.get(taraget_url)
However, I found that the final request did not return me what I want.
So I changed the method. I download all the network traffic and look into the headers and cookies of the last request. I found that here are some contents that are different in each log in session like the sessionid
and some other variables. So I traces back when these varibales are returned in the response and then get the values again by sending the corresponding requests. After this, I construct the correct headers and cookies and then send request like this:
resp = ss.get(target_url, headers = myheader, cookies = mycookie)
But still, it does not return me anything. Anyone can help?
Upvotes: 0
Views: 1454
Reputation: 27611
I was in the same boat some time ago, and I eventually switched from trying to get requests to work to using Selenium instead, which made life much easier. (pip install selenium
). Then you can log into a website and then navigate to a desired website like this:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
website_with_logins = "https://website.com"
website_to_access_after_login = "https://website.com/page"
driver.get( str(website_with_logins) )
username = driver.find_element_by_name("username")
username.send_keys("your_username")
password = driver.find_element_by_name("password")
password.send_keys("your_password")
password.send_keys(Keys.RETURN)
driver.get( str(website_to_access_after_login) )
Once you have the website_to_access_after_login
loaded (you'll see it appear), you can get the html and have a field day using just
html = driver.page_source
Hope this helps.
Upvotes: 1