Reputation: 107
I want to scrape the information of all the cards on this website:
My approach :
import pandas as pd
import numpy as np
import requests
import json
from bs4 import BeautifulSoup
url1 = "https://zerotomastery.io/testimonials/"
res = requests.get(url1)
blog_data = []
if (res.status_code == 200):
page = BeautifulSoup(res.content , "html.parser")
print(page.find("div" , {"class" : "divcomponent__Div-sc-hnfdyq-0 base-cardstyles__BaseCard-sc-1eokxla-0 testimonial-cardstyles__TestimonialCard-sc-137v3r9-0 dRXcRh ipQTEw"}))
As you can cleary see that the class
is present.
Upvotes: 1
Views: 110
Reputation: 25073
as you can cleary see that the class is present, can you please tell me why my code is not working ??
Class is present for sure, but your code is not working, cause there is a typo / additional whitespace in your classes 137v3r9-0 dRXcRh
.
It is a good strategy to avoid dynamic classes for element selection and use more static things like id or HTML structure.
Select your cards based on its HTML structure by css selector
:
soup.select('div:has(>h2+span)')
Itereate your ResultSet
and simply pick your information, again just by structure and append it as dict
to your list
:
for card in soup.select('div:has(>h2+span)'):
data.append({
'name':card.h2.text,
'title': card.span.text,
'text': card.p.text,
'url': card.a.get('href')
})
Finaly create a DataFrame
from your list
:
pd.DataFrame(data)
import requests
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://zerotomastery.io/testimonials/').content)
data = []
for card in soup.select('div:has(>h2+span)'):
data.append({
'name':card.h2.text,
'title': card.span.text,
'text': card.p.text,
'url': card.a.get('href')
})
pd.DataFrame(data)
privacy redaction
...
Upvotes: 0
Reputation: 28595
YOu are searching for a very specific 'class'
which appears to be dynamically created. A better option is to use something a little more general sub string found in those classes, such as "TestimonialCard"
.
import pandas as pd
import numpy as np
import requests
import json
from bs4 import BeautifulSoup
import re
url1 = "https://zerotomastery.io/testimonials/"
res = requests.get(url1)
rows = []
if (res.status_code == 200):
page = BeautifulSoup(res.content , "html.parser")
testCards = page.find_all("div" , {"class" : re.compile('.*TestimonialCard.*')})
for card in testCards:
name = card.find('h2').text
job = card.find('span').text
company = card.find('img', {'class':re.compile('.*CompanyImage.*')})['alt']
test = card.find('p').text
row = {
'name':name,
'job':job,
'company':company,
'testimonial':test}
rows.append(row)
I simply didn't have time to search through the nested json to pull out the part you were asking for, but it's somewhere in there.
Output:
print(df)
privacy redaction
[54 rows x 4 columns]
df = pd.DataFrame(rows)
Upvotes: 1
Reputation: 195438
To get name, title and text from the cards you can use following example:
import requests
from bs4 import BeautifulSoup
url = "https://zerotomastery.io/testimonials/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for h2 in soup.select("h2")[1:]:
name = h2.text
title = h2.find_next("span").text
text = h2.find_next("p").text
print(name)
print(title)
print(text)
print("-" * 80)
Prints:
...
--------------------------------------------------------------------------------
privacy redaction
--------------------------------------------------------------------------------
Upvotes: 0