ossama assaghir
ossama assaghir

Reputation: 308

can't get span text inside of a ul using beautifulsoup

I'm trying to scrape reviews from a website to be specific I'm trying to get the texts from inside a span but I get no results. I'm new at Python and scraping so I don't know exactly what I'm doing.

Code:

from bs4 import BeautifulSoup
import requests

url = "https://forum.bouyguestelecom.fr"
req = requests.get(url).text

soup = BeautifulSoup(req, 'html.parser')

activites = soup.find_all('div', class_="RF_CONTENT")

for activite in activites:
    NickName = activite.find("span", class_="RFW_ACTIVITY_SECTION_ACTIVITY_FEED_ELEM_RESUME_NICKNAME").text
    Name = activite.find("span", class_="RFW_ACTIVITY_SECTION_ACTIVITY_FEED_ELEM_RESUME_ACH_NAME").text
    print(NickName)
    print(Name)

Here is the HTML I'm trying to get data from:

<div class="RFW_ACTIVITY_SECTION_ACTIVITY_FEED_ELEM_RESUME">
<div class="RF_CONTENT">
<p><a href="https://forum.bouyguestelecom.fr/questions/2814793-trouve-inadmissibile-ligne-active-partir-20-novembre-realiser-commande-3-11" target="_blank">

<span class="RFW_ACTIVITY_SECTION_ACTIVITY_FEED_ELEM_RESUME_NICKNAME">Celia q.</span>
    <span class="RFW_ACTIVITY_SECTION_ACTIVITY_FEED_ELEM_RESUME_ACTION"> a posé une question: Je trouve inadmissibile que ma ligne ne soit active qu'a partir du 20 novembre en ayant réaliser la commande le 3/11 ?</span><span class="RFW_ACTIVITY_SECTION_ACTIVITY_FEED_ELEM_RESUME_DATETIME">Il y a 2 minute(s)</span></a>
</p>

<div class="RFW_ACTIVITY_SECTION_ACTIVITY_FEED_SCORE">20  <span class="RF_TEXT_INDICE"> pts</span></div></div>
</div></li><li class="RFW_ACTIVITY_SECTION_ACTIVITY_FEED_ELEM">

<div class="RFW_ACTIVITY_SECTION_ACTIVITY_FEED_ELEM_PERFORMER">
<img src="https://api.rocketfid.com/media/default/rainbow/user/" alt="">
</div>

You can see it for yourself here. My code returns no results.

here is a photo of reviews : reviews

Upvotes: 0

Views: 210

Answers (1)

Dmitriy Zub
Dmitriy Zub

Reputation: 1724

You can parse this data by making a direct request to the website API/Server, without using selenium for such a task.


Code and example in the online IDE:

What about user-agent, if no user-agent is being passed into request headers while using requests library it defaults to python-requests so websites might understand that it's a bot/script, and block a request. Check what's your user-agent.

In this case, it would work without user-agent but in most cases, it's needed to act as a "real" user visit, but again, only adding user-agent won't be enough in some cases, if so, additional request headers need to be added.

import requests, json

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36",
    "content-type": "application/json",
    "x-instance": "rainbow"  # only this matter, otherwise it will blow up with 501 error
}

response = requests.get("https://api.rocketfid.com/activity/cache/all/0/5/", headers=headers).text

# print data to see the actual JSON string
# and grab additional data you want the same way as shown below
# simpy add additional variable and access it by KEY
data = json.loads(response)

for result in data:
    nick_name = result["performer"]["nickname"].title()

    # if data in achievement is empty -> then its located in the action key
    # using try/except is a one way of handling it
    try:
        name = result["achievement"]["name"]
    except:
        name = result["action"]["name"]

    try:
        score = result["achievement"]["score"]
    except:
        score = result["action"]["score"]

    # additional data here.. 

    print(f"{nick_name}\n{name}\n{score}\n")


# output
'''
Adrien S.
En savoir plus
10

Adrien S.
 a posé une question: Remboursement clé 4g ?
20

Marie-Claude P.
Première fois
20

Marie-Claude P.
 a répondu à la question: Comment supprimer l’autre numéro de contact dans mes préférences de contact ?
10

Marie-Claude P.
a fait la première réponse à une question
20
'''

Upvotes: 1

Related Questions