seralouk
seralouk

Reputation: 33147

How to get h3 tag with class in web scraping Python

I want to scrape the text of an h3 with class as shown in the attached photo.

I modified the code based on the posted recommendation:

import requests
import urllib

session = requests.session()
session.headers.update({
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0',
  'Accept': '*/*',
  'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
  'Content-Type': 'application/json',
  'Origin': 'https://auth.fool.com',
  'Connection': 'keep-alive',
})

response1 = session.get("https://www.fool.com/secure/login.aspx")
assert response1

response1.cookies
#<RequestsCookieJar[Cookie(version=0, name='_csrf', value='8PrzU3pSVQ12xoLeq2y7TuE1', port=None, port_specified=False, domain='auth.fool.com', domain_specified=False, domain_initial_dot=False, path='/usernamepassword/login', path_specified=True, secure=True, expires=1609597114, discard=False, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False)]>

params = urllib.parse.parse_qs(response1.url)
params

payload = {
    "client_id": params["client"][0],
    "redirect_uri": "https://www.fool.com/premium/auth/callback/",
    "tenant": "fool",
    "response_type": "code",
    "scope": "openid email profile",
    "state": params["https://auth.fool.com/login?state"][0],
    "_intstate": "deprecated",
    "nonce": params["nonce"][0],
    "password": "XXX",
    "connection": "TMF-Reg-API",
    "username": "XXX",
}
formatted_payload = "{" + ",".join([f'"{key}":"{value}"' for key, value in payload.items()]) + "}"



url = "https://auth.fool.com/usernamepassword/login"
response2 = session.post(url, data=formatted_payload)

response2.cookies
#<RequestsCookieJar[]>

response2.cookies is empty thus it seems that the login fails.

Upvotes: 0

Views: 705

Answers (1)

Gregor
Gregor

Reputation: 682

I can only give you some partial advice but you might be able to find the "last missing piece" (I have no access to the premium content of your target page). It's correct, that you need to login first, in order to get the content:

What's usually useful is using a session that handles cookies. Also, a proper header often does the trick:

import requests
import urllib

session = requests.session()
session.headers.update({
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0',
  'Accept': '*/*',
  'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
  'Content-Type': 'application/json',
  'Origin': 'https://auth.fool.com',
  'Connection': 'keep-alive',
})

Next we get some cookies for our session from the "official" login page:

response = session.get("https://www.fool.com/secure/login.aspx")
assert response

We will use some of the response URL (yes, there are a couple of redirects) parameters to get a valid payload for the actual login:

params = urllib.parse.parse_qs(response.url)
params

payload = {
    "client_id": params["client"][0],
    "redirect_uri": "https://www.fool.com/premium/auth/callback/",
    "tenant": "fool",
    "response_type": "code",
    "scope": "openid email profile",
    "state": params["https://auth.fool.com/login?state"][0],
    "_intstate": "deprecated",
    "nonce": params["nonce"][0],
    "password": "#pas$w0яδ",
    "connection": "TMF-Reg-API",
    "username": "[email protected]",
}
formatted_payload = "{" + ",".join([f'"{key}":"{value}"' for key, value in payload.items()]) + "}"

Finally, we can login:

url = "https://auth.fool.com/usernamepassword/login"
response = session.post(url, data=formatted_payload)

Let me know if you are able to login or if we need to tweak the script. And just some general comments: I normally use an incognito tab to inspect the browser requests an then copy them over to postman where I play around with the parameters and see how they influence the HTTP response. I rarely use Selenium but rather invest the time to build a proper requests tu be used with python's internal library and then use BeautifulSoup.

Edit: After logging in, you can use BeautifulSoup to parse the content of the actual site:

# add BeautifulSoup to our project
from bs4 import BeautifulSoup

# use the session with the login cookies to fetch the data
the_url = "https://www.fool.com/premium/stock-advisor/coverage/tags/buy-recommendation"
data = BeautifulSoup(session.get(the_url).text, 'html.parser')
my_h3 = data.find("h3", "content-item-headline")

Upvotes: 1

Related Questions