Reputation: 461
I am trying to scrape some text off of https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms
, but as you can see when it loads up the link through web-driver it automatically redirects it to a log in page. After I log in, it then goes straight to the page I want to scrape, but Beautiful Soup just keeps scraping the log in page.
How do I make it so Beautiful Soup scrapes the page I want it to and not the login page?
I have already tried putting a time.sleep()
before it scrapes to give me time to log in but that didn't work either.
soup = BeautifulSoup(requests.get("https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms").text, 'html.parser')
while True:
front_half = soup.find_all(class_='qquestion qtext')
print(front_half)
time.sleep(1)
Upvotes: 2
Views: 3411
Reputation: 71
What you could do is use selenium. And simply write browser.get("website.you.need") this will take you to the login page. Login manually for once. Now add a for loop of links you need to scrape of the same website in same program, so that browser does not get closed and hence you do not loose the session. So til the program does not end, the links you want to access, you can.
Your code might look like this.
from selenium import webdriver
import time
browser = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
browser.get('abc.com/page=1')
# this link will redirect you to the login page. Enter your credentials manually. And wait for logging in successfully. 30 seconds would be enough
time.sleep(30)
links = ["abc.com/page=1","abc.com/page=2"]
for j in range(len(links)):
link = links[j]
browser.get(link)
#this wont need login as you are not closing the
time.sleep(5)
html = browser.page_source
# do your scraping or save the html sourcecode somewhere and scrape it later.
browser.close()
Upvotes: 0
Reputation: 3752
What you probably need is a persistent session with requests
. This answer probably covers exactly what you need. The general idea is simple:
You will need to understand how the login post request is structured and what data is passed (username, email, etc) and create a json
with that data.
import requests
url = 'https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms'
session = requests.session()
login_data = {
'username': ,
'csrfmiddlewaretoken': ,
'password': ,
'next': '/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms'
}
session.get(url) #this will redirect you and it might load some initial cookies info
r = session.post('https://<theurl>/login.py', login_data)
if r.status_code == 200: #if accepted the request
res = session.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
## (...) your scraping code
Upvotes: 1