Khaled Koubaa
Khaled Koubaa

Reputation: 527

Scraping next pages

I have a code to scrape hotels reviews in python (from yelp).

The code scrape the first page of reviews perfectly, but, I am struggling to scrape the next pages.

The While loop don't work, data scraped in each loop is the same (data of the first page)

import requests
from lxml import html
from bs4 import BeautifulSoup

url = 'https://www.yelp.com/biz/fairmont-san-francisco-san-francisco?sort_by=rating_desc'
while url:

    r = requests.get(url)
    t = html.fromstring(r.content)
    for i in t.xpath("//div[@class='review-list']/ul/li[position()>1]"):
        rev = i.xpath('.//p[@lang="en"]/text()')[0].strip()
        date = i.xpath('.//span[@class="rating-qualifier"]/text()')[0].strip()
        stars = i.xpath('.//img[@class="offscreen"]/@alt')[0].strip().split(' ')[0]
        print(rev)
        print(date) 
        print(stars) 

    next_page = soup.find('a',{'class':'next'})
    if next_page:
        url = next_page['href']
    else:
        url = None

    sleep(5)

here sleep(5) before request new url is to avoid limitation set by the website.

Upvotes: 0

Views: 325

Answers (2)

SIM
SIM

Reputation: 22440

The following is one of the ways you can get your job done. I've slightly modified your existing logic of traversing next pages. Give it a shot.

import requests
from lxml.html import fromstring

url = 'https://www.yelp.com/biz/fairmont-san-francisco-san-francisco?sort_by=rating_desc'

while True:
    res = requests.get(url)
    root = fromstring(res.text)
    for item in root.xpath("//div[@class='review-list']/ul/li[position()>1]"):
        rev = item.xpath('.//p[@lang="en"]/text()')[0].strip()
        print(rev)

    next_page = root.cssselect(".pagination-links a.next")
    if not len(next_page): break
    url = next_page[0].get('href')

Upvotes: 3

Andrew Bowling
Andrew Bowling

Reputation: 206

You just need to be smart about looking at the URL. Most websites follow a scheme with their page progression. In this case, it seems like it changes to the following format for the next pages:

https://www.yelp.com/biz/fairmont-san-francisco-san-francisco?start=20&sort_by=rating_desc

Where the start=20 is where we should be looking. Rewrite the url at the end of the while loop. Once it gets to the end of the page, it should add 20 to that number, and then put it in the string. Like so:

pagenum = 0
while url
    pagenum += 20
    url = "https://www.yelp.com/biz/fairmont-san-francisco-san-francisco?start=" + pagenum + "&sort_by=rating_desc"

And then to terminate the program in a try/except catch, where the url wouldn't load because there' no more pages.

Upvotes: 2

Related Questions