Jack
Jack

Reputation: 461

How to scrape a page if it is redirected to another before

I am trying to scrape some text off of https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms, but as you can see when it loads up the link through web-driver it automatically redirects it to a log in page. After I log in, it then goes straight to the page I want to scrape, but Beautiful Soup just keeps scraping the log in page.

How do I make it so Beautiful Soup scrapes the page I want it to and not the login page?

I have already tried putting a time.sleep() before it scrapes to give me time to log in but that didn't work either.

soup = BeautifulSoup(requests.get("https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms").text, 'html.parser')
while True:
    front_half = soup.find_all(class_='qquestion qtext')
    print(front_half)
    time.sleep(1)

Upvotes: 2

Views: 3411

Answers (2)

shivam singh
shivam singh

Reputation: 71

What you could do is use selenium. And simply write browser.get("website.you.need") this will take you to the login page. Login manually for once. Now add a for loop of links you need to scrape of the same website in same program, so that browser does not get closed and hence you do not loose the session. So til the program does not end, the links you want to access, you can.

Your code might look like this.

from selenium import webdriver 
import time 


browser = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver") 
browser.get('abc.com/page=1')
# this link will redirect you to the login page. Enter your credentials manually. And wait for logging in successfully. 30 seconds would be enough
time.sleep(30)

links = ["abc.com/page=1","abc.com/page=2"]

for j in range(len(links)):
    link = links[j]

    browser.get(link)
    #this wont need login as you are not closing the 
    time.sleep(5)
    html = browser.page_source
    # do your scraping or save the html sourcecode somewhere and scrape it later.

browser.close()

Upvotes: 0

rrcal
rrcal

Reputation: 3752

What you probably need is a persistent session with requests. This answer probably covers exactly what you need. The general idea is simple:

  1. You open a session and send a request to the website
  2. Send the login post request so it logs you in
  3. Query the url with the same session.

You will need to understand how the login post request is structured and what data is passed (username, email, etc) and create a json with that data.

import requests

url = 'https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms'

session = requests.session()

login_data = {
    'username': ,
    'csrfmiddlewaretoken': ,
    'password': ,
    'next': '/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms'
}

session.get(url) #this will redirect you and it might load some initial cookies info

r = session.post('https://<theurl>/login.py', login_data)

if r.status_code == 200: #if accepted the request
    res = session.get(url)
    soup = BeautifulSoup(res.text, 'html.parser')
    ## (...) your scraping code

Upvotes: 1

Related Questions