Adjusting Web Scraping Code for another site

Question

I'm currently using this code to web scrape reviews from TrustPilot. I wish to adjust the code to scrape reviews from (https://boxes.mysubscriptionaddiction.com/box/boxycharm?ratings=true#review-update-create). However, unlike most other review sites, the reviews are not separated into multiple sub-pages but there is instead a button at the end of the page to "view more reviews" which shows 3 additional reviews whenever you press it.

Is it possible to adjust the code such that it is able to scrape all the reviews from this particular product within the website with this kind of web structure?

from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
print ('all imported successfuly')

# Initialize an empty dataframe
df = pd.DataFrame()
for x in range(1, 44):
    names = []
    headers = []
    bodies = []
    ratings = []
    published = []
    updated = []
    reported = []

    link = (f'https://www.trustpilot.com/review/birchbox.com?page={x}')
    print (link)
    req = requests.get(link)
    content = req.content
    soup = BeautifulSoup(content, "lxml")
    articles = soup.find_all('article', {'class':'review'})
    for article in articles:
        names.append(article.find('div', attrs={'class': 'consumer-information__name'}).text.strip())
        headers.append(article.find('h2', attrs={'class':'review-content__title'}).text.strip())
        try:
            bodies.append(article.find('p', attrs={'class':'review-content__text'}).text.strip())
        except:
            bodies.append('')

        try:
            #ratings.append(article.find('div', attrs={'class':'star-rating star-rating--medium'}).text.strip())
            #ratings.append(article.find('div', attrs={'class': 'star-rating star-rating--medium'})['alt'])
            ratings.append(article.find_all("img", alt=True)[0]["alt"])
        except:
            ratings.append('')
        dateElements = article.find('div', attrs={'class':'review-content-header__dates'}).text.strip()

        jsonData = json.loads(dateElements)
        published.append(jsonData['publishedDate'])
        updated.append(jsonData['updatedDate'])
        reported.append(jsonData['reportedDate'])


    # Create your temporary dataframe of the first iteration, then append that into your "final" dataframe
    temp_df = pd.DataFrame({'User Name': names, 'Header': headers, 'Body': bodies, 'Rating': ratings, 'Published Date': published, 'Updated Date':updated, 'Reported Date':reported})
    df = df.append(temp_df, sort=False).reset_index(drop=True)

print ('pass1')


df.to_csv('BirchboxReviews2.0.csv', index=False, encoding='utf-8')
print ('excel done')

αԋɱҽԃ αмєяιcαη · Accepted Answer

Basically you are dealing with a website which is dynamically loaded via JavaScript code once the page loads, where the comments is rendered with JS code on each scroll down.

I've been able to navigate to the XHR request which obtain the Comments from JS and I've been able to call it and retrieve all comments you asked for.

You don't need to use selenium as it's will slow down your task process.

Here you can achieve your target. assuming that each page include 3 comments. so we just math it to work on the full pages.

import requests
from bs4 import BeautifulSoup
import math


def PageNum():
    r = requests.get(
        "https://boxes.mysubscriptionaddiction.com/box/boxycharm?ratings=true#review-update-create")
    soup = BeautifulSoup(r.text, 'html.parser')
    num = int(
        soup.find("a", class_="show-more-reviews").text.split(" ")[3][1:-1])
    if num % 3 == 0:
        return (num / 3) + 1
    else:
        return math.ceil(num / 3) + 2


def Main():
    num = PageNum()
    headers = {
        'X-Requested-With': 'XMLHttpRequest'
    }
    with requests.Session() as req:
        for item in range(1, num):
            print(f"Extracting Page# {item}")
            r = req.get(
                f"https://boxes.mysubscriptionaddiction.com/get_user_reviews?box_id=105&page={item}", headers=headers)
            soup = BeautifulSoup(r.text, 'html.parser')
            for com in soup.findAll("div", class_=r'"comment-body"'):
                print(com.text[5:com.text.find(r"
", 3)])


Main()

Simple of the output:

Number of Pages 49
Extracting Page# 1
****************************************
I think Boxycharm overall is the best beauty subscription. However, I think it's 
ridiculous that if you want to upgrade you have to pay the 25 for the first box and then add additional money to get the premium. Even though it's only one time, 
that's insane. So about 80 bucks just to switch to Premium. And suppose U do that and then my Boxy Premium shows up at my door. I open it ....and absolutely hate 
the majority if everything I have. Yeah I would be furious! Not worth taking a chance on. Boxy only shows up half the time with actual products or colors I use.  
I love getting the monthly boxes, just wish they would have followed my preferences for colors!
I used to really get excited for my boxes. But not so much anymore.  This months 
Fenty box choices lack! I am not a clown
Extracting Page# 2
****************************************
Love it its awsome
Boxycharm has always been a favorite subscription box, I’ve had it off and on , love most of the goodies.  I get frustrated when they don’t curate it to fit me and or customer service isn’t that helpful but overall a great box’!
I like BoxyCharm but to be honest I feel like some months they don’t even look at your beauty profile because I sometimes get things I clearly said I wasn’t interested in getting.
Extracting Page# 3
****************************************
The BEST sub box hands down. 
I love all the boxy charm boxes everything is amazing all full size products and 
the colors are outstanding
I absolutely love Boxycharm.  I have received amazing high end products.  My makeup cart is so full I have such a variety everyday. I love the new premium box and paired with Boxyluxe I recieve 15 products for $85 The products are worth anywhere from $500 to $700  total.  I used to spend $400 a month buying products at Ulta. I would HIGHLY recommend this subscription.

Adjusting Web Scraping Code for another site

Answers (2)

Related Questions