Reputation: 107

Why BS4 do not find a an element by its class even though it is present in HTML?

I want to scrape the information of all the cards on this website:

My approach :

import pandas as pd 
import numpy as np
import requests
import json
from bs4 import BeautifulSoup
url1 = "https://zerotomastery.io/testimonials/"
res = requests.get(url1)
blog_data = []
if (res.status_code == 200):
    page = BeautifulSoup(res.content , "html.parser")
    print(page.find("div" , {"class" : "divcomponent__Div-sc-hnfdyq-0 base-cardstyles__BaseCard-sc-1eokxla-0 testimonial-cardstyles__TestimonialCard-sc-137v3r9-0  dRXcRh ipQTEw"}))

As you can cleary see that the class is present.

Upvotes: 1

Answers (3)

HedgeHog

Reputation: 25073

as you can cleary see that the class is present, can you please tell me why my code is not working ??

Class is present for sure, but your code is not working, cause there is a typo / additional whitespace in your classes 137v3r9-0 dRXcRh.

It is a good strategy to avoid dynamic classes for element selection and use more static things like id or HTML structure.

Select your cards based on its HTML structure by css selector:

soup.select('div:has(>h2+span)')

Itereate your ResultSet and simply pick your information, again just by structure and append it as dict to your list:

for card in soup.select('div:has(>h2+span)'):
    data.append({
        'name':card.h2.text,
        'title': card.span.text,
        'text': card.p.text,
        'url': card.a.get('href')
    })

Finaly create a DataFrame from your list:

pd.DataFrame(data)

Example

import requests
import pandas as pd
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get('https://zerotomastery.io/testimonials/').content)
data = []

for card in soup.select('div:has(>h2+span)'):
    data.append({
        'name':card.h2.text,
        'title': card.span.text,
        'text': card.p.text,
        'url': card.a.get('href')
    })

pd.DataFrame(data)

Output

privacy redaction

...

Upvotes: 0

chitown88

Reputation: 28595

YOu are searching for a very specific 'class' which appears to be dynamically created. A better option is to use something a little more general sub string found in those classes, such as "TestimonialCard".

import pandas as pd 
import numpy as np
import requests
import json
from bs4 import BeautifulSoup
import re

url1 = "https://zerotomastery.io/testimonials/"
res = requests.get(url1)
rows = []
if (res.status_code == 200):
    page = BeautifulSoup(res.content , "html.parser")
    testCards = page.find_all("div" , {"class" : re.compile('.*TestimonialCard.*')})

    for card in testCards:
        name = card.find('h2').text
        job = card.find('span').text
        company = card.find('img', {'class':re.compile('.*CompanyImage.*')})['alt']
        test = card.find('p').text

        row = {
            'name':name,
            'job':job,
            'company':company,
            'testimonial':test}
        
        rows.append(row)

I simply didn't have time to search through the nested json to pull out the part you were asking for, but it's somewhere in there.

Output:

print(df)

privacy redaction

[54 rows x 4 columns]
            
    df = pd.DataFrame(rows)

Upvotes: 1

Andrej Kesely

Reputation: 195438

To get name, title and text from the cards you can use following example:

import requests
from bs4 import BeautifulSoup


url = "https://zerotomastery.io/testimonials/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

for h2 in soup.select("h2")[1:]:
    name = h2.text
    title = h2.find_next("span").text
    text = h2.find_next("p").text
    print(name)
    print(title)
    print(text)
    print("-" * 80)

Prints:


...

--------------------------------------------------------------------------------
privacy redaction
--------------------------------------------------------------------------------

Upvotes: 0

Why BS4 do not find a an element by its class even though it is present in HTML?

Answers (3)

Example

Output

Related Questions