Kent Umeki
Kent Umeki

Reputation: 1

Data Mining IMDB Reviews - Only extracting the first 25 reviews

I am currently trying to extract all the reviews on Spiderman Homecoming movie but I am only able to get the first 25 reviews. I was able to load more in IMDB to get all the reviews as originally it only shows the first 25 but for some reason I am unable to mine all the reviews after every review has been loaded. Does anyone know what I am doing wrong?

Below is the code I am running:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By


#Set the web browser
driver = webdriver.Chrome(executable_path=r"C:\Users\Kent_\Desktop\WorkStudy\chromedriver.exe")

#Go to Google
driver.get("https://www.imdb.com/title/tt6320628/reviews?ref_=tt_urv")

#Loop load more button
wait = WebDriverWait(driver,10)
while True:
    try:
        driver.find_element_by_css_selector("button#load-more-trigger").click()
        wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR,".ipl-load-more__load-indicator")))
        soup = BeautifulSoup(driver.page_source, 'lxml')
    except Exception:break


#Scrape IMBD review
ans = driver.current_url
page = requests.get(ans)
soup = BeautifulSoup(page.content, "html.parser")
all = soup.find(id="main")

#Get the title of the movie
all = soup.find(id="main")
parent = all.find(class_ ="parent")
name = parent.find(itemprop = "name")
url = name.find(itemprop = 'url')
film_title = url.get_text()
print('Pass finding phase.....')

#Get the title of the review
title_rev = all.select(".title")
title = [t.get_text().replace("\n", "") for t in title_rev]
print('getting title of reviews and saving into a list')

#Get the review
review_rev = all.select(".content .text")
review = [r.get_text() for r in review_rev]
print('getting content of reviews and saving into a list')

#Make it into dataframe
table_review = pd.DataFrame({
    "Title" : title,
    "Review" : review
})
table_review.to_csv('Spiderman_Reviews.csv')

print(title)
print(review)

Upvotes: 0

Views: 921

Answers (1)

MendelG
MendelG

Reputation: 20048

Well, actually, there's no need to use Selenium. The data is available via sending a GET request to the websites API in the following format:

https://www.imdb.com/title/tt6320628/reviews/_ajax?ref_=undefined&paginationKey=MY-KEY

where you have to provide a key for the paginationKey in the URL (...&paginationKey=MY-KEY)

The key is found in the class load-more-data:

<div class="load-more-data" data-key="g4wp7crmqizdeyyf72ux5nrurdsmqhjjtzpwzouokkd2gbzgpnt6uc23o4zvtmzlb4d46f2swblzkwbgicjmquogo5tx2">
            </div>

So, to scrape all the reviews into a DataFrame, try:

import pandas as pd
import requests
from bs4 import BeautifulSoup


url = (
    "https://www.imdb.com/title/tt6320628/reviews/_ajax?ref_=undefined&paginationKey={}"
)
key = ""
data = {"title": [], "review": []}

while True:
    response = requests.get(url.format(key))
    soup = BeautifulSoup(response.content, "html.parser")
    # Find the pagination key
    pagination_key = soup.find("div", class_="load-more-data")
    if not pagination_key:
        break

    # Update the `key` variable in-order to scrape more reviews
    key = pagination_key["data-key"]
    for title, review in zip(
        soup.find_all(class_="title"), soup.find_all(class_="text show-more__control")
    ):
        data["title"].append(title.get_text(strip=True))
        data["review"].append(review.get_text())

df = pd.DataFrame(data)
print(df)

Output (truncated):

                                                title                                             review
0                              Terrific entertainment  Spiderman: Far from Home is not intended to be...
1         THe illusion of the identity of Spider man.  Great story in continuation of spider man home...
2                       What Happened to the Bad Guys  I believe that Quinten Beck/Mysterio got what ...
3                                         Spectacular  One of the best if not the best Spider-Man mov...

...
...

Upvotes: 4

Related Questions