Reputation: 308
I'm trying to scrape reviews from a website to be specific I'm trying to get the texts from inside a span but I get no results. I'm new at Python and scraping so I don't know exactly what I'm doing.
Code:
from bs4 import BeautifulSoup
import requests
url = "https://forum.bouyguestelecom.fr"
req = requests.get(url).text
soup = BeautifulSoup(req, 'html.parser')
activites = soup.find_all('div', class_="RF_CONTENT")
for activite in activites:
NickName = activite.find("span", class_="RFW_ACTIVITY_SECTION_ACTIVITY_FEED_ELEM_RESUME_NICKNAME").text
Name = activite.find("span", class_="RFW_ACTIVITY_SECTION_ACTIVITY_FEED_ELEM_RESUME_ACH_NAME").text
print(NickName)
print(Name)
Here is the HTML I'm trying to get data from:
<div class="RFW_ACTIVITY_SECTION_ACTIVITY_FEED_ELEM_RESUME">
<div class="RF_CONTENT">
<p><a href="https://forum.bouyguestelecom.fr/questions/2814793-trouve-inadmissibile-ligne-active-partir-20-novembre-realiser-commande-3-11" target="_blank">
<span class="RFW_ACTIVITY_SECTION_ACTIVITY_FEED_ELEM_RESUME_NICKNAME">Celia q.</span>
<span class="RFW_ACTIVITY_SECTION_ACTIVITY_FEED_ELEM_RESUME_ACTION"> a posé une question: Je trouve inadmissibile que ma ligne ne soit active qu'a partir du 20 novembre en ayant réaliser la commande le 3/11 ?</span><span class="RFW_ACTIVITY_SECTION_ACTIVITY_FEED_ELEM_RESUME_DATETIME">Il y a 2 minute(s)</span></a>
</p>
<div class="RFW_ACTIVITY_SECTION_ACTIVITY_FEED_SCORE">20 <span class="RF_TEXT_INDICE"> pts</span></div></div>
</div></li><li class="RFW_ACTIVITY_SECTION_ACTIVITY_FEED_ELEM">
<div class="RFW_ACTIVITY_SECTION_ACTIVITY_FEED_ELEM_PERFORMER">
<img src="https://api.rocketfid.com/media/default/rainbow/user/" alt="">
</div>
You can see it for yourself here. My code returns no results.
Upvotes: 0
Views: 210
Reputation: 1724
You can parse this data by making a direct request to the website API/Server, without using selenium
for such a task.
Code and example in the online IDE:
What about user-agent
, if no user-agent
is being passed into request headers
while using requests
library it defaults to python-requests so websites might understand that it's a bot/script, and block a request. Check what's your user-agent
.
In this case, it would work without user-agent
but in most cases, it's needed to act as a "real" user visit, but again, only adding user-agent
won't be enough in some cases, if so, additional request headers need to be added.
import requests, json
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36",
"content-type": "application/json",
"x-instance": "rainbow" # only this matter, otherwise it will blow up with 501 error
}
response = requests.get("https://api.rocketfid.com/activity/cache/all/0/5/", headers=headers).text
# print data to see the actual JSON string
# and grab additional data you want the same way as shown below
# simpy add additional variable and access it by KEY
data = json.loads(response)
for result in data:
nick_name = result["performer"]["nickname"].title()
# if data in achievement is empty -> then its located in the action key
# using try/except is a one way of handling it
try:
name = result["achievement"]["name"]
except:
name = result["action"]["name"]
try:
score = result["achievement"]["score"]
except:
score = result["action"]["score"]
# additional data here..
print(f"{nick_name}\n{name}\n{score}\n")
# output
'''
Adrien S.
En savoir plus
10
Adrien S.
a posé une question: Remboursement clé 4g ?
20
Marie-Claude P.
Première fois
20
Marie-Claude P.
a répondu à la question: Comment supprimer l’autre numéro de contact dans mes préférences de contact ?
10
Marie-Claude P.
a fait la première réponse à une question
20
'''
Upvotes: 1