DouglasFalcon
DouglasFalcon

Reputation: 41

How to get datetime with BeautifulSoup

I want to scrape all the comments on this website : https://fr.trustpilot.com/review/www.gammvert.fr

I would like to have the comment, the rating and the date. I managed to obtain the comment and the rating but not the date.

Here's my script so far :

import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

root_url = 'https://fr.trustpilot.com/review/jardiland.com'
urls = [ '{root}?page={i}'.format(root=root_url, i=i) for i in range(1,9) ]

comms = []
notes = []
dates = []

for url in urls: 
    results = requests.get(url)

    soup = BeautifulSoup(results.text, "html.parser")

    commentary = soup.find_all('div', class_='review-content')

    for container in commentary:

        comm  = container.find('p', class_ = 'review-content__text').text.strip()
        comms.append(comm)

        note = container.find('div', class_ = 'star-rating star-rating--medium').find('img')['alt']
        notes.append(note)

        date = container.div.div.find('div', class_ = 'review-content-header__dates')
        dates.append(date)

data = pd.DataFrame({
    'comms' : comms,
    'notes' : notes,
    'dates' : dates
    })

data['comms'] = data['comms'].str.replace('\n', '')


#print(data.head())
data.to_csv('filetest.csv', sep=';', index=False)

And here's the html for the date : html

I would like to have "datetime" or "title", but not the text because when it's recent, that's not the date who is specified but like "two hours ago" and that's pointless.

Any ideas ?

Thanks à lot :)

Upvotes: 1

Views: 1943

Answers (2)

Prayson W. Daniel
Prayson W. Daniel

Reputation: 15588

Data is gathered in JSON_LD. Make sure you have permission from Trustpilot.

import json
from requests import Session
from bs4 import BeautifulSoup
URL = 'https://fr.trustpilot.com/review/jardiland.com?page=2'

session = Session()
r = session.get(URL)
soup = BeautifulSoup(r.text)
data = soup.find('script',{'type':'application/ld+json'})
data_json = json.loads(data.getText(strip=True))

# now you can assess data as dictionary

Upvotes: 0

MendelG
MendelG

Reputation: 20118

The dates are loaded dynamically, therefore requests doesn't support it. However, the dates are available in JSON format on the website, you can find them using the re module, and convert them to a dict with the json module.

import re
import json
import requests
import numpy as np
import pandas as pd
from requests import get
from bs4 import BeautifulSoup


root_url = "https://fr.trustpilot.com/review/jardiland.com"
urls = ["{root}?page={i}".format(root=root_url, i=i) for i in range(1, 9)]

comms = []
notes = []
dates = []

for url in urls:
    results = requests.get(url)

    soup = BeautifulSoup(results.text, "html.parser")

    commentary = soup.find_all("div", class_="review-content")

    for container in commentary:

        comm = container.find("p", class_="review-content__text").text.strip()
        comms.append(comm)

        note = container.find("div", class_="star-rating star-rating--medium").find(
            "img"
        )["alt"]
        notes.append(note)

        date_tag = container.div.div.find("div", class_="review-content-header__dates")
        date = json.loads(re.search(r"({.*})", str(date_tag)).group(1))["publishedDate"]

        dates.append(date)


data = pd.DataFrame({"comms": comms, "notes": notes, "dates": dates})

data["comms"] = data["comms"].str.replace("\n", "")


print(data.head())
data.to_csv("filetest.csv", sep=";", index=False)

Output:

                                               comms  ...                      dates
0  Suite à un achat effectué fin novembre, j’ai e...  ...  2020-12-11T10:37:32+00:00
1  Aujourd'hui dans le magasin de Beaucouzé Anger...  ...  2020-12-05T17:28:57+00:00
2  A FUIR! Sur les deux commandes passée : - La p...  ...  2020-12-04T20:31:34+00:00
3  Si vous avez une réclamation évitez le Jardila...  ...  2020-12-01T07:18:55+00:00
4  Quelle honten ! J'ai acheté une nappe ce weeke...  ...  2020-11-25T10:01:31+00:00

[5 rows x 3 columns]

Upvotes: 1

Related Questions