Reputation: 527
I have a code to scrape hotels reviews in python (from yelp).
The code scrape the first page of reviews perfectly, but, I am struggling to scrape the next pages.
The While loop don't work, data scraped in each loop is the same (data of the first page)
import requests
from lxml import html
from bs4 import BeautifulSoup
url = 'https://www.yelp.com/biz/fairmont-san-francisco-san-francisco?sort_by=rating_desc'
while url:
r = requests.get(url)
t = html.fromstring(r.content)
for i in t.xpath("//div[@class='review-list']/ul/li[position()>1]"):
rev = i.xpath('.//p[@lang="en"]/text()')[0].strip()
date = i.xpath('.//span[@class="rating-qualifier"]/text()')[0].strip()
stars = i.xpath('.//img[@class="offscreen"]/@alt')[0].strip().split(' ')[0]
print(rev)
print(date)
print(stars)
next_page = soup.find('a',{'class':'next'})
if next_page:
url = next_page['href']
else:
url = None
sleep(5)
here sleep(5) before request new url is to avoid limitation set by the website.
Upvotes: 0
Views: 325
Reputation: 22440
The following is one of the ways you can get your job done. I've slightly modified your existing logic of traversing next pages. Give it a shot.
import requests
from lxml.html import fromstring
url = 'https://www.yelp.com/biz/fairmont-san-francisco-san-francisco?sort_by=rating_desc'
while True:
res = requests.get(url)
root = fromstring(res.text)
for item in root.xpath("//div[@class='review-list']/ul/li[position()>1]"):
rev = item.xpath('.//p[@lang="en"]/text()')[0].strip()
print(rev)
next_page = root.cssselect(".pagination-links a.next")
if not len(next_page): break
url = next_page[0].get('href')
Upvotes: 3
Reputation: 206
You just need to be smart about looking at the URL. Most websites follow a scheme with their page progression. In this case, it seems like it changes to the following format for the next pages:
https://www.yelp.com/biz/fairmont-san-francisco-san-francisco?start=20&sort_by=rating_desc
Where the start=20 is where we should be looking. Rewrite the url at the end of the while loop. Once it gets to the end of the page, it should add 20 to that number, and then put it in the string. Like so:
pagenum = 0
while url
pagenum += 20
url = "https://www.yelp.com/biz/fairmont-san-francisco-san-francisco?start=" + pagenum + "&sort_by=rating_desc"
And then to terminate the program in a try/except catch, where the url wouldn't load because there' no more pages.
Upvotes: 2