Scraping next pages

Question

I have a code to scrape hotels reviews in python (from yelp).

The code scrape the first page of reviews perfectly, but, I am struggling to scrape the next pages.

The While loop don't work, data scraped in each loop is the same (data of the first page)

import requests
from lxml import html
from bs4 import BeautifulSoup

url = 'https://www.yelp.com/biz/fairmont-san-francisco-san-francisco?sort_by=rating_desc'
while url:

    r = requests.get(url)
    t = html.fromstring(r.content)
    for i in t.xpath("//div[@class='review-list']/ul/li[position()>1]"):
        rev = i.xpath('.//p[@lang="en"]/text()')[0].strip()
        date = i.xpath('.//span[@class="rating-qualifier"]/text()')[0].strip()
        stars = i.xpath('.//img[@class="offscreen"]/@alt')[0].strip().split(' ')[0]
        print(rev)
        print(date) 
        print(stars) 

    next_page = soup.find('a',{'class':'next'})
    if next_page:
        url = next_page['href']
    else:
        url = None

    sleep(5)

here sleep(5) before request new url is to avoid limitation set by the website.

SIM · Accepted Answer

The following is one of the ways you can get your job done. I've slightly modified your existing logic of traversing next pages. Give it a shot.

import requests
from lxml.html import fromstring

url = 'https://www.yelp.com/biz/fairmont-san-francisco-san-francisco?sort_by=rating_desc'

while True:
    res = requests.get(url)
    root = fromstring(res.text)
    for item in root.xpath("//div[@class='review-list']/ul/li[position()>1]"):
        rev = item.xpath('.//p[@lang="en"]/text()')[0].strip()
        print(rev)

    next_page = root.cssselect(".pagination-links a.next")
    if not len(next_page): break
    url = next_page[0].get('href')

Scraping next pages

Answers (2)

Related Questions