BenJHC
BenJHC

Reputation: 89

Scraping data presented on an overlay / new window

I am completely new to web scraping and would like to scrape reviews and property replies from: https://www.hostelworld.com/hosteldetails.php/HI-NYC-Hostel/New-York/1850#reviews

However, the HTML I obtain seems to be for the hostel page rather than the overlay page with the reviews, and I was wondering how to obtain and scrape from the reviews panel instead.

I can scrape user reviews using the snippet below,

from bs4 import BeautifulSoup

url = 'https://www.hostelworld.com/hosteldetails.php/HI-NYC-Hostel/New-York/1850#reviews'

response = requests.get(url)
SoupPage = BeautifulSoup(response.text, 'html.parser')
reviews = SoupPage.find_all(class_="review-info")

for rev in reviews:
    text = rev.find(class_="notes")

but it appears to be from a different source to the reviews panel since I do not see any classes or text corresponding to the property replies. Any help or suggestions would be appreciated.

Upvotes: 0

Views: 708

Answers (1)

Nazim Kerimbekov
Nazim Kerimbekov

Reputation: 4783

If you want to scrape the whole review panel (all of the pages) I would recommend using the following link:

import requests
import pandas as pd

numb_of_pages = 10 #enter the number of pages you want to scrape
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0"}
df = pd.DataFrame()

for nmb in range(1,10):
    url = f"https://www.hostelworld.com/properties/1850/reviews?sort=newest&page={nmb}&monthCount=36"
    data_raw = requests.get(url, headers=headers).json()
    df = df.append(data_raw["reviews"])

    print(f"page: {nmb} out of {numb_of_pages}")

Alternatively, if you only want a few pages worth of comments you can use the code below:

import requests
import pandas

numb_of_pages = 10 #enter the number of pages you want to __scrape__ 

headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0"}
df = pd.DataFrame()

for nmb in range(1,numb_of_pages):
    url = f"https://www.hostelworld.com/properties/1850/reviews?sort=newest&page={nmb}&monthCount=36"
    data_raw = requests.get(url, headers=headers).json()
    df = df.append(data_raw["reviews"])
    
    print(f"page: {nmb} out of {numb_of_pages}")
    
print(df)

(PS: the reviews are received in the form of a JSON string so you don't need BeautifulSoup)

I hope this helps

Upvotes: 1

Related Questions