Reputation: 89
I am completely new to web scraping and would like to scrape reviews and property replies from: https://www.hostelworld.com/hosteldetails.php/HI-NYC-Hostel/New-York/1850#reviews
However, the HTML I obtain seems to be for the hostel page rather than the overlay page with the reviews, and I was wondering how to obtain and scrape from the reviews panel instead.
I can scrape user reviews using the snippet below,
from bs4 import BeautifulSoup
url = 'https://www.hostelworld.com/hosteldetails.php/HI-NYC-Hostel/New-York/1850#reviews'
response = requests.get(url)
SoupPage = BeautifulSoup(response.text, 'html.parser')
reviews = SoupPage.find_all(class_="review-info")
for rev in reviews:
text = rev.find(class_="notes")
but it appears to be from a different source to the reviews panel since I do not see any classes or text corresponding to the property replies. Any help or suggestions would be appreciated.
Upvotes: 0
Views: 708
Reputation: 4783
If you want to scrape the whole review panel (all of the pages) I would recommend using the following link:
import requests
import pandas as pd
numb_of_pages = 10 #enter the number of pages you want to scrape
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0"}
df = pd.DataFrame()
for nmb in range(1,10):
url = f"https://www.hostelworld.com/properties/1850/reviews?sort=newest&page={nmb}&monthCount=36"
data_raw = requests.get(url, headers=headers).json()
df = df.append(data_raw["reviews"])
print(f"page: {nmb} out of {numb_of_pages}")
Alternatively, if you only want a few pages worth of comments you can use the code below:
import requests
import pandas
numb_of_pages = 10 #enter the number of pages you want to __scrape__
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0"}
df = pd.DataFrame()
for nmb in range(1,numb_of_pages):
url = f"https://www.hostelworld.com/properties/1850/reviews?sort=newest&page={nmb}&monthCount=36"
data_raw = requests.get(url, headers=headers).json()
df = df.append(data_raw["reviews"])
print(f"page: {nmb} out of {numb_of_pages}")
print(df)
(PS: the reviews are received in the form of a JSON string so you don't need BeautifulSoup)
I hope this helps
Upvotes: 1