Python Request entire HTML page, instead of initially loaded content

Question

I am trying to get some data of reviews publicly available on the PlayStore, and since the provided API only allows to get reviews for one own's apps, I am trying to scrape it from the web.

I am using requests package to get the HTML page of a given app on the PlayStore and will use BeautifulSoup to parse it and save it to file, to then extract the relevant content (rating and comment of each user).

My issue is that not the entire content of the page is retrieved with request.get(URL). Navigating to the "Read All Reviews" on an app on the PlayStore, one gets to a page with all reviews for that app. Unfortunately, though, only a limited set of reviews loads when first loading the page, while the rest of the reviews only loads upon scrolling down to the bottom. By calling request.get(URL) only that limited set of reviews is retrieved, instead of all reviews.

Try navigating to https://play.google.com/store/apps/details?id=com.bendingspoons.thirtydayfitness&hl=en&showAllReviews=true and see older reviews load only when scrolling to the bottom of the page.

Is there a way to access the entire page/trigger the loading of more reviews/simulating the scrolling?

Below is my code:

# get reviews for Thirty Days of Fitness app
URL = "https://play.google.com/store/apps/details?id=com.bendingspoons.thirtydayfitness&hl=en&showAllReviews=true"

# make request
request = requests.get(URL)
# extract HTML text
raw_text = request.text

# parse HTML and prettify
soup = BeautifulSoup(raw_text, 'html.parser')
text = soup.prettify()

# write to file
save_path = './thirtydayfitness_html.txt'
with open(save_path, 'w+', encoding=request.encoding) as f:
    f.write(text)

Muhammad Ashfaq · Accepted Answer

Would consider using a web driver to scroll down. Like so

SCROLL_PAUSE_TIME = 0.5

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

Reference:- How can I scroll a web page using selenium webdriver in python?

Python Request entire HTML page, instead of initially loaded content

Answers (1)

Related Questions