Reputation: 41
I want to scrape all the comments on this website : https://fr.trustpilot.com/review/www.gammvert.fr
I would like to have the comment, the rating and the date. I managed to obtain the comment and the rating but not the date.
Here's my script so far :
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
root_url = 'https://fr.trustpilot.com/review/jardiland.com'
urls = [ '{root}?page={i}'.format(root=root_url, i=i) for i in range(1,9) ]
comms = []
notes = []
dates = []
for url in urls:
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")
commentary = soup.find_all('div', class_='review-content')
for container in commentary:
comm = container.find('p', class_ = 'review-content__text').text.strip()
comms.append(comm)
note = container.find('div', class_ = 'star-rating star-rating--medium').find('img')['alt']
notes.append(note)
date = container.div.div.find('div', class_ = 'review-content-header__dates')
dates.append(date)
data = pd.DataFrame({
'comms' : comms,
'notes' : notes,
'dates' : dates
})
data['comms'] = data['comms'].str.replace('\n', '')
#print(data.head())
data.to_csv('filetest.csv', sep=';', index=False)
And here's the html for the date : html
I would like to have "datetime" or "title", but not the text because when it's recent, that's not the date who is specified but like "two hours ago" and that's pointless.
Any ideas ?
Thanks à lot :)
Upvotes: 1
Views: 1943
Reputation: 15588
Data is gathered in JSON_LD. Make sure you have permission from Trustpilot.
import json
from requests import Session
from bs4 import BeautifulSoup
URL = 'https://fr.trustpilot.com/review/jardiland.com?page=2'
session = Session()
r = session.get(URL)
soup = BeautifulSoup(r.text)
data = soup.find('script',{'type':'application/ld+json'})
data_json = json.loads(data.getText(strip=True))
# now you can assess data as dictionary
Upvotes: 0
Reputation: 20118
The dates are loaded dynamically, therefore requests
doesn't support it. However, the dates are available in JSON format on the website, you can find them using the re
module, and convert them to a dict
with the json
module.
import re
import json
import requests
import numpy as np
import pandas as pd
from requests import get
from bs4 import BeautifulSoup
root_url = "https://fr.trustpilot.com/review/jardiland.com"
urls = ["{root}?page={i}".format(root=root_url, i=i) for i in range(1, 9)]
comms = []
notes = []
dates = []
for url in urls:
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")
commentary = soup.find_all("div", class_="review-content")
for container in commentary:
comm = container.find("p", class_="review-content__text").text.strip()
comms.append(comm)
note = container.find("div", class_="star-rating star-rating--medium").find(
"img"
)["alt"]
notes.append(note)
date_tag = container.div.div.find("div", class_="review-content-header__dates")
date = json.loads(re.search(r"({.*})", str(date_tag)).group(1))["publishedDate"]
dates.append(date)
data = pd.DataFrame({"comms": comms, "notes": notes, "dates": dates})
data["comms"] = data["comms"].str.replace("\n", "")
print(data.head())
data.to_csv("filetest.csv", sep=";", index=False)
Output:
comms ... dates
0 Suite à un achat effectué fin novembre, j’ai e... ... 2020-12-11T10:37:32+00:00
1 Aujourd'hui dans le magasin de Beaucouzé Anger... ... 2020-12-05T17:28:57+00:00
2 A FUIR! Sur les deux commandes passée : - La p... ... 2020-12-04T20:31:34+00:00
3 Si vous avez une réclamation évitez le Jardila... ... 2020-12-01T07:18:55+00:00
4 Quelle honten ! J'ai acheté une nappe ce weeke... ... 2020-11-25T10:01:31+00:00
[5 rows x 3 columns]
Upvotes: 1