C.Otteaga
C.Otteaga

Reputation: 23

Scrape password protected website with no token

(I'm sorry for my english i'll try to do my best) :

I'm a newbie in python and i'm seeking for help for some web scraping. I already have a functionable code to get the links i want but the website is protected by a password. with the help of a lot of question i read i managed to get a working code to scrape the website after the login but the links i want are on another page :

the login page is http://fantasy.trashtalk.co/login.php

the landing page (the one i scrape with this code) after login is http://fantasy.trashtalk.co/

and the page i want is http://fantasy.trashtalk.co/?tpl=classement&t=1

So i have this code (some import are probably useless, they come from another code):

from bs4 import BeautifulSoup
import requests
from lxml import html
import urllib.request
import re

username = 'myusername'
password = 'mypass'
url = "http://fantasy.trashtalk.co/?tpl=classement&t=1"
log = "http://fantasy.trashtalk.co/login.php"

values = {'email': username,
          'password': password}

r = requests.post(log, data=values)

# Not sure about the code below but it works.
data = r.text

soup = BeautifulSoup(data, 'lxml')

tags = soup.find_all('a')

for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
    print(link.get('href'))

I understand that this code only allow me to access to the login page then scrape what come next (the landing page) but i don't figure out how to "save" my loggin info to access the page i want to scrape.

i think i should add something like this after the login code but when i do it it only scrape my links from the login page :

s = request.get(url)

Also i read some topic here using "with session" thing ? but i didn't managed to make it work.

Any of help would be appreciated. Thank you for your time.

Upvotes: 2

Views: 3809

Answers (1)

Dascienz
Dascienz

Reputation: 1071

The issue was that you needed to save your login credentials by posting them through a session object, not a request object. I've modified your code below and you now have access to the html tags located in the scrape_url page. Good luck!

import requests
from bs4 import BeautifulSoup

username = 'email'
password = 'password'
scrape_url = 'http://fantasy.trashtalk.co/?tpl=classement&t=1'

login_url = 'http://fantasy.trashtalk.co/login.php'
login_info = {'email': username,'password': password}

#Start session.
session = requests.session()

#Login using your authentication information.
session.post(url=login_url, data=login_info)

#Request page you want to scrape.
url = session.get(url=scrape_url)

soup = BeautifulSoup(url.content, 'html.parser')

for link in soup.findAll('a'):
    print('\nLink href: ' + link['href'])
    print('Link text: ' + link.text)

Upvotes: 3

Related Questions