Reputation: 101
I'm trying to use this code to collect reviews from the Consumer Affairs review site. But I kept getting errors, specifically in the dateElements & jsonData section. Could someone help me fix this code to be compatible with the site I'm going to web scrape?
from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
print ('all imported successfuly')
# Initialize an empty dataframe
df = pd.DataFrame()
for x in range(1, 5):
names = []
headers = []
bodies = []
ratings = []
published = []
updated = []
reported = []
link = (f'https://www.consumeraffairs.com/online/allure-beauty-box.html?page={x}')
print (link)
req = requests.get(link)
content = req.content
soup = BeautifulSoup(content, "lxml")
articles = soup.find_all('div', {'class':'rvw js-rvw'})
for article in articles:
names.append(article.find('strong', attrs={'class': 'rvw-aut__inf-nm'}).text.strip())
try:
bodies.append(article.find('p', attrs={'class':'rvw-bd'}).text.strip())
except:
bodies.append('')
try:
ratings.append(article.find('div', attrs={'class':'stars-rtg stars-rtg--sm'}).text.strip())
except:
ratings.append('')
dateElements = article.find('span', attrs={'class':'ca-txt-cpt'}).text.strip()
jsonData = json.loads(dateElements)
published.append(jsonData['publishedDate'])
updated.append(jsonData['updatedDate'])
reported.append(jsonData['reportedDate'])
# Create your temporary dataframe of the first iteration, then append that into your "final" dataframe
temp_df = pd.DataFrame({'User Name': names, 'Body': bodies, 'Rating': ratings, 'Published Date': published, 'Updated Date':updated, 'Reported Date':reported})
df = df.append(temp_df, sort=False).reset_index(drop=True)
print ('pass1')
df.to_csv('AllureReviews.csv', index=False, encoding='utf-8')
print ('excel done')
This is the error I'm getting
Traceback (most recent call last): File "C:/Users/Sara Jitkresorn/PycharmProjects/untitled/venv/Caffairs.py", line 37, in jsonData = json.loads(dateElements) File "C:\Users\Sara Jitkresorn\AppData\Local\Programs\Python\Python37\lib\json__init__.py", line 348, in loads return _default_decoder.decode(s) File "C:\Users\Sara Jitkresorn\AppData\Local\Programs\Python\Python37\lib\json\decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "C:\Users\Sara Jitkresorn\AppData\Local\Programs\Python\Python37\lib\json\decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Upvotes: 2
Views: 733
Reputation: 965
In addition to the above code we can get the ratings and non-duplicated data as below:-
from bs4 import BeautifulSoup
import requests
import pandas as pd
print ('all imported successfuly')
# Initialize an empty dataframe
df = pd.DataFrame()
for x in range(1, 5):
names = []
headers = []
bodies = []
ratings = []
published = []
updated = []
reported = []
dateElements = []
link = (f'https://www.consumeraffairs.com/online/allure-beauty-box.html?page={x}')
print (link)
req = requests.get(link)
content = req.content
soup = BeautifulSoup(content, "lxml")
articles = soup.find_all('div', {'class':'rvw js-rvw'})
for article in articles:
names.append(article.find('strong', attrs={'class': 'rvw-aut__inf-nm'}).text.strip())
try:
bodies.append(article.find('div', attrs={'class':'rvw-bd'}).text.strip())
except:
bodies.append('NA')
try:
ratings.append(article.find('meta', attrs={'itemprop': 'ratingValue'})['content'])
except:
ratings.append('NA')
dateElements.append(article.find('span', attrs={'class':'ca-txt-cpt'}).text.strip())
# Create your temporary dataframe of the first iteration, then append that into your "final" dataframe
temp_df = pd.DataFrame({'User Name': names, 'Body': bodies, 'Rating': ratings, 'Published Date': dateElements})
df = df.append(temp_df, sort=False).reset_index(drop=True)
print ('df')
Upvotes: 3
Reputation: 10194
dateElements
doesn't contain a string that can be parsed by json.loads()
because it is simply a text string e.g. Original review: Feb. 15, 2020
Change these lines to circumvent this:
try:
ratings.append(article.find('div', attrs={'class':'stars-rtg stars-rtg--sm'}).text.strip())
except:
ratings.append('')
dateElements = article.find('span', attrs={'class':'ca-txt-cpt'}).text.strip()
published.append(dateElements)
temp_df = pd.DataFrame({'User Name': names, 'Body': bodies, 'Rating': ratings, 'Published Date': published})
df = df.append(temp_df, sort=False).reset_index(drop=True)
You also have to comment out these two lines:
# updated = []
# reported = []
Than your code runs without errors, although you still don't get data for Body
and Rating
.
df
print out to this:
User Name Body Rating Published Date
0 M. M. of Dallas, GA Original review: Feb. 15, 2020
1 Malinda of Aston, PA Original review: Sept. 21, 2019
2 Ping of Tarzana, CA Original review: July 18, 2019
Upvotes: 1