How to get datetime with BeautifulSoup

Question

I want to scrape all the comments on this website : https://fr.trustpilot.com/review/www.gammvert.fr

I would like to have the comment, the rating and the date. I managed to obtain the comment and the rating but not the date.

Here's my script so far :

import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

root_url = 'https://fr.trustpilot.com/review/jardiland.com'
urls = [ '{root}?page={i}'.format(root=root_url, i=i) for i in range(1,9) ]

comms = []
notes = []
dates = []

for url in urls: 
    results = requests.get(url)

    soup = BeautifulSoup(results.text, "html.parser")

    commentary = soup.find_all('div', class_='review-content')

    for container in commentary:

        comm  = container.find('p', class_ = 'review-content__text').text.strip()
        comms.append(comm)

        note = container.find('div', class_ = 'star-rating star-rating--medium').find('img')['alt']
        notes.append(note)

        date = container.div.div.find('div', class_ = 'review-content-header__dates')
        dates.append(date)

data = pd.DataFrame({
    'comms' : comms,
    'notes' : notes,
    'dates' : dates
    })

data['comms'] = data['comms'].str.replace('
', '')


#print(data.head())
data.to_csv('filetest.csv', sep=';', index=False)

And here's the html for the date : html

I would like to have "datetime" or "title", but not the text because when it's recent, that's not the date who is specified but like "two hours ago" and that's pointless.

Any ideas ?

Thanks à lot :)

MendelG · Accepted Answer

The dates are loaded dynamically, therefore requests doesn't support it. However, the dates are available in JSON format on the website, you can find them using the re module, and convert them to a dict with the json module.

import re
import json
import requests
import numpy as np
import pandas as pd
from requests import get
from bs4 import BeautifulSoup


root_url = "https://fr.trustpilot.com/review/jardiland.com"
urls = ["{root}?page={i}".format(root=root_url, i=i) for i in range(1, 9)]

comms = []
notes = []
dates = []

for url in urls:
    results = requests.get(url)

    soup = BeautifulSoup(results.text, "html.parser")

    commentary = soup.find_all("div", class_="review-content")

    for container in commentary:

        comm = container.find("p", class_="review-content__text").text.strip()
        comms.append(comm)

        note = container.find("div", class_="star-rating star-rating--medium").find(
            "img"
        )["alt"]
        notes.append(note)

        date_tag = container.div.div.find("div", class_="review-content-header__dates")
        date = json.loads(re.search(r"({.*})", str(date_tag)).group(1))["publishedDate"]

        dates.append(date)


data = pd.DataFrame({"comms": comms, "notes": notes, "dates": dates})

data["comms"] = data["comms"].str.replace("
", "")


print(data.head())
data.to_csv("filetest.csv", sep=";", index=False)

Output:

                                               comms  ...                      dates
0  Suite à un achat effectué fin novembre, j’ai e...  ...  2020-12-11T10:37:32+00:00
1  Aujourd'hui dans le magasin de Beaucouzé Anger...  ...  2020-12-05T17:28:57+00:00
2  A FUIR! Sur les deux commandes passée : - La p...  ...  2020-12-04T20:31:34+00:00
3  Si vous avez une réclamation évitez le Jardila...  ...  2020-12-01T07:18:55+00:00
4  Quelle honten ! J'ai acheté une nappe ce weeke...  ...  2020-11-25T10:01:31+00:00

[5 rows x 3 columns]

How to get datetime with BeautifulSoup

Answers (2)

Related Questions