Sara Jitkresorn
Sara Jitkresorn

Reputation: 101

Adjusting Web Scraping Code for another site

I'm currently using this code to web scrape reviews from TrustPilot. I wish to adjust the code to scrape reviews from (https://boxes.mysubscriptionaddiction.com/box/boxycharm?ratings=true#review-update-create). However, unlike most other review sites, the reviews are not separated into multiple sub-pages but there is instead a button at the end of the page to "view more reviews" which shows 3 additional reviews whenever you press it.

Is it possible to adjust the code such that it is able to scrape all the reviews from this particular product within the website with this kind of web structure?

from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
print ('all imported successfuly')

# Initialize an empty dataframe
df = pd.DataFrame()
for x in range(1, 44):
    names = []
    headers = []
    bodies = []
    ratings = []
    published = []
    updated = []
    reported = []

    link = (f'https://www.trustpilot.com/review/birchbox.com?page={x}')
    print (link)
    req = requests.get(link)
    content = req.content
    soup = BeautifulSoup(content, "lxml")
    articles = soup.find_all('article', {'class':'review'})
    for article in articles:
        names.append(article.find('div', attrs={'class': 'consumer-information__name'}).text.strip())
        headers.append(article.find('h2', attrs={'class':'review-content__title'}).text.strip())
        try:
            bodies.append(article.find('p', attrs={'class':'review-content__text'}).text.strip())
        except:
            bodies.append('')

        try:
            #ratings.append(article.find('div', attrs={'class':'star-rating star-rating--medium'}).text.strip())
            #ratings.append(article.find('div', attrs={'class': 'star-rating star-rating--medium'})['alt'])
            ratings.append(article.find_all("img", alt=True)[0]["alt"])
        except:
            ratings.append('')
        dateElements = article.find('div', attrs={'class':'review-content-header__dates'}).text.strip()

        jsonData = json.loads(dateElements)
        published.append(jsonData['publishedDate'])
        updated.append(jsonData['updatedDate'])
        reported.append(jsonData['reportedDate'])


    # Create your temporary dataframe of the first iteration, then append that into your "final" dataframe
    temp_df = pd.DataFrame({'User Name': names, 'Header': headers, 'Body': bodies, 'Rating': ratings, 'Published Date': published, 'Updated Date':updated, 'Reported Date':reported})
    df = df.append(temp_df, sort=False).reset_index(drop=True)

print ('pass1')


df.to_csv('BirchboxReviews2.0.csv', index=False, encoding='utf-8')
print ('excel done')

Upvotes: 0

Views: 294

Answers (2)

Basically you are dealing with a website which is dynamically loaded via JavaScript code once the page loads, where the comments is rendered with JS code on each scroll down.

I've been able to navigate to the XHR request which obtain the Comments from JS and I've been able to call it and retrieve all comments you asked for.

You don't need to use selenium as it's will slow down your task process.

Here you can achieve your target. assuming that each page include 3 comments. so we just math it to work on the full pages.

import requests
from bs4 import BeautifulSoup
import math


def PageNum():
    r = requests.get(
        "https://boxes.mysubscriptionaddiction.com/box/boxycharm?ratings=true#review-update-create")
    soup = BeautifulSoup(r.text, 'html.parser')
    num = int(
        soup.find("a", class_="show-more-reviews").text.split(" ")[3][1:-1])
    if num % 3 == 0:
        return (num / 3) + 1
    else:
        return math.ceil(num / 3) + 2


def Main():
    num = PageNum()
    headers = {
        'X-Requested-With': 'XMLHttpRequest'
    }
    with requests.Session() as req:
        for item in range(1, num):
            print(f"Extracting Page# {item}")
            r = req.get(
                f"https://boxes.mysubscriptionaddiction.com/get_user_reviews?box_id=105&page={item}", headers=headers)
            soup = BeautifulSoup(r.text, 'html.parser')
            for com in soup.findAll("div", class_=r'\"comment-body\"'):
                print(com.text[5:com.text.find(r"\n", 3)])


Main()

Simple of the output:

Number of Pages 49
Extracting Page# 1
****************************************
I think Boxycharm overall is the best beauty subscription. However, I think it's 
ridiculous that if you want to upgrade you have to pay the 25 for the first box and then add additional money to get the premium. Even though it's only one time, 
that's insane. So about 80 bucks just to switch to Premium. And suppose U do that and then my Boxy Premium shows up at my door. I open it ....and absolutely hate 
the majority if everything I have. Yeah I would be furious! Not worth taking a chance on. Boxy only shows up half the time with actual products or colors I use.  
I love getting the monthly boxes, just wish they would have followed my preferences for colors!
I used to really get excited for my boxes. But not so much anymore.  This months 
Fenty box choices lack! I am not a clown
Extracting Page# 2
****************************************
Love it its awsome
Boxycharm has always been a favorite subscription box, I’ve had it off and on , love most of the goodies.  I get frustrated when they don’t curate it to fit me and or customer service isn’t that helpful but overall a great box’!
I like BoxyCharm but to be honest I feel like some months they don’t even look at your beauty profile because I sometimes get things I clearly said I wasn’t interested in getting.
Extracting Page# 3
****************************************
The BEST sub box hands down. 
I love all the boxy charm boxes everything is amazing all full size products and 
the colors are outstanding
I absolutely love Boxycharm.  I have received amazing high end products.  My makeup cart is so full I have such a variety everyday. I love the new premium box and paired with Boxyluxe I recieve 15 products for $85 The products are worth anywhere from $500 to $700  total.  I used to spend $400 a month buying products at Ulta. I would HIGHLY recommend this subscription.  

Upvotes: 2

Prakhar Jhudele
Prakhar Jhudele

Reputation: 955

Also I have worked out the code for your website. It uses selenium for button clicks and scrolling do let me know if you have any doubts. I still suggest you go through the article first:-

# -*- coding: utf-8 -*-
"""
Created on Sun Mar  8 18:09:45 2020

@author: prakharJ
"""

from selenium import webdriver
import time
import pandas as pd

names_found = []
comments_found = []
ratings_found = []
dateElements_found = []

# Web extraction of web page boxes
print("scheduled to run boxesweb scrapper ")
driver = webdriver.Chrome(executable_path='Your/path/to/chromedriver.exe') 
webpage = 'https://boxes.mysubscriptionaddiction.com/box/boxycharm?ratings=true#review-update-create' 
driver.get(webpage)

SCROLL_PAUSE_TIME = 6

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight*0.80);")

    time.sleep(SCROLL_PAUSE_TIME)
    try:
        b = driver.find_element_by_class_name('show-more-reviews')
        b.click()
        time.sleep(SCROLL_PAUSE_TIME)
    except Exception:
        s ='no button'

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")

    if new_height == last_height:
        break
    last_height = new_height

names_list = driver.find_elements_by_class_name('name')
comment_list = driver.find_elements_by_class_name('comment-body')
rating_list = driver.find_elements_by_xpath("//meta[@itemprop='ratingValue']")
date_list = driver.find_elements_by_class_name('comment-date')
for names in names_list:
    names_found.append(names.text)
for bodies in comment_list:
    try:
        comments_found.append(bodies.text)
    except:
        comments_found.append('NA')
for ratings in rating_list:
    try:
        ratings_found.append(ratings.get_attribute("content"))
    except:
        ratings_found.append('NA')
for dateElements in date_list:
    dateElements_found.append(dateElements.text)
# Create your temporary dataframe of the first iteration, then append that into your "final" dataframe
temp_df = pd.DataFrame({'User Name': names_found, 'Body': comments_found, 'Rating': ratings_found, 'Published Date': dateElements_found})
#df = df.append(temp_df, sort=False).reset_index(drop=True)
print('extraction completed for the day and system goes into sleep mode')
driver.quit()

Upvotes: 1

Related Questions