onr
onr

Reputation: 296

Scraping pagination with Python

I`m trying to scrape some data for airlines from the following website: http://www.airlinequality.com/airline-reviews/airasia-x[1].

I managed to get the data I need, but I am struggling with pagination on the web page. I`m trying to get all the title of the reviews (not only the ones in the first page).

The links of the pages are in the format: http://www.airlinequality.com/airline-reviews/airasia-x/page/3/ where 3 is the number of the page.

I tried to loop through these URLs and also the following piece of code but scraping through the pagination is not working.

# follow pagination links
for href in response.css('#main > section.layout-section.layout-2.closer-top > div.col-content > div > article > ul li a'):
    yield response.follow(href, self.parse)

How to solve this?

import scrapy
import re  # for text parsing
import logging
from scrapy.crawler import CrawlerProcess


class AirlineSpider(scrapy.Spider):
    name = 'airlineSpider'
    # page to scrape
    start_urls = ['http://www.airlinequality.com/review-pages/a-z-airline-reviews/']  

    def parse(self, response):
        # take each element in the list of the airlines

        for airline in response.css("div.content ul.items li"):
            # go inside the URL for each airline
            airline_url = airline.css('a::attr(href)').extract_first()

            # Call parse_airline
            next_page = airline_url
            if next_page is not None:
                yield response.follow(next_page, callback=self.parse_article)

            # follow pagination links
            for href in response.css('#main > section.layout-section.layout-2.closer-top > div.col-content > div > article > ul li a'):
                yield response.follow(href, self.parse)

    # to go to the pages inside the links (for each airline) - the page where the reviews are
    def parse_article(self, response):
        yield {
            'appears_ulr': response.url,
            # use sub to replace \n\t \r from the result
            'title':  re.sub('\s+', ' ', (response.css('div.info [itemprop="name"]::text').extract_first()).strip(' \t \r \n').replace('\n', ' ') ).strip(),
            'reviewTitle': response.css('div.body .text_header::text').extract(),
            #'total': response.css('#main > section.layout-section.layout-2.closer-top > div.col-content > div > article > div.pagination-total::text').extract_first().split(" ")[4],
        }


process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'FEED_FORMAT': 'json',
    'FEED_URI': 'air_test.json'
})

# minimizing the information presented on the scrapy log
logging.getLogger('scrapy').setLevel(logging.WARNING)
process.crawl(AirlineSpider)
process.start()

To iterate through the airlines I solved it using this code: it using the piece of code above:

req = Request("http://www.airlinequality.com/review-pages/a-z-airline-reviews/" , headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req)
soupAirlines = BeautifulSoup(html_page, "lxml")

URL_LIST = []
for link in soupAirlines.findAll('a',  attrs={'href': re.compile("^/airline-reviews/")}):
    URL_LIST.append("http://www.airlinequality.com"+link.get('href'))

Upvotes: 4

Views: 6538

Answers (1)

datawrestler
datawrestler

Reputation: 1567

Assuming scrapy is not a hard requirement, the following code in BeautifulSoup will get you all the reviews, with meta data parsed out, and a final output of a pandas DataFrame. The specific attributes being pulled from each review includes:

  • Review Title
  • Rating (when available)
  • Rating out of scale (i.e. out of 10)
  • Review full text
  • Date stamp of review
  • Whether or not the review is verified

There is a specific function that handles the pagination. It is a recursive function in that if there is a next page, we call the function again to parse the new url, otherwise the function calls end.

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

# define global parameters
URL = 'http://www.airlinequality.com/airline-reviews/airasia-x'
BASE_URL = 'http://www.airlinequality.com'
MASTER_LIST = []

def parse_review(review):
    """
    Parse important review meta data such as ratings, time of review, title, 
    etc.

    Parameters
    -------
    review - beautifulsoup tag 

    Return 
    -------
    outdf - pd.DataFrame
        DataFrame representation of parsed review
    """

    # get review header
    header = review.find('h2').text

    # get the numerical rating
    base_review = review.find('div', {'itemprop': 'reviewRating'})
    if base_review is None:
        rating = None
        rating_out_of = None
    else:
        rating = base_review.find('span', {'itemprop': 'ratingValue'}).text
        rating_out_of = base_review.find('span', {'itemprop': 'bestRating'}).text

    # get time of review
    time_of_review = review.find('h3').find('time')['datetime']

    # get whether review is verified
    if review.find('em'):
        verified = review.find('em').text
    else:
        verified = None

    # get actual text of review
    review_text = review.find('div', {'class': 'text_content'}).text

    outdf = pd.DataFrame({'header': header,
                         'rating': rating,
                         'rating_out_of': rating_out_of,
                         'time_of_review': time_of_review,
                         'verified': verified,
                         'review_text': review_text}, index=[0])

    return outdf

def return_next_page(soup):
    """
    return next_url if pagination continues else return None

    Parameters
    -------
    soup - BeautifulSoup object - required

    Return 
    -------
    next_url - str or None if no next page
    """
    next_url = None
    cur_page = soup.find('a', {'class': 'active'}, href=re.compile('airline-reviews/airasia'))
    cur_href = cur_page['href']
    # check if next page exists
    search_next = cur_page.findNext('li').get('class')
    if not search_next:
        next_page_href = cur_page.findNext('li').find('a')['href']
        next_url = BASE_URL + next_page_href
    return next_url

def create_soup_reviews(url):
    """
    iterate over each review, extract out content, and handle next page logic 
    through recursion

    Parameters
    -------
    url - str - required
        input url
    """
    # use global MASTER_LIST to extend list of all reviews 
    global MASTER_LIST
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    reviews = soup.findAll('article', {'itemprop': 'review'})
    review_list = [parse_review(review) for review in reviews]
    MASTER_LIST.extend(review_list)
    next_url = return_next_page(soup)
    if next_url is not None:
        create_soup_reviews(next_url)


create_soup_reviews(URL)


finaldf = pd.concat(MASTER_LIST)
finaldf.shape # (339, 6)

finaldf.head(2)
# header    rating  rating_out_of   review_text time_of_review  verified
#"if approved I will get my money back" 1   10  ✅ Trip Verified | Kuala Lumpur to Melbourne. ...    2018-08-07  Trip Verified
#   "a few minutes error"   3   10  ✅ Trip Verified | I've flied with AirAsia man...    2018-08-06  Trip Verified

If I were to do the whole site, I would use the above and iterate over each airline here. I would modify the code to include a column named 'airline' so you know which airline each review corresponds to.

Upvotes: 2

Related Questions