Xin
Xin

Reputation: 674

How to scrape all pages without knowing how many pages there are

I have the following function to gather all the prices but I am having issues scraping the total number of pages. How would I be able to scrape through all the pages without knowing the amount of pages there are?

import requests
from bs4 import BeautifulSoup
import pandas as pd
import itertools

def get_data(page):
    url = 'https://www.remax.ca/bc/vancouver--real-estate?page='+page
    page = requests.get(url)
    soup = BeautifulSoup(page,'html.parser')
    price = soup.find_all('h3', {'class' : 'price'})
    price = pd.DataFrame([(p.text) for p in price]).rename(columns = {0:'Price'})
    return price

what I tried but doesn't seem to work

for pages in itertools.count(start=1):
    try:
        table = get_data('1').append(table)
    except Exception:
        break

Upvotes: 0

Views: 937

Answers (3)

tklodd
tklodd

Reputation: 1079

This is a great opportunity for recursion, provided that you do not anticipate more than 1000 pages, because I think Python only allows a maximum stack depth of 1000:

import requests
from bs4 import BeautifulSoup
import pandas as pd

def get_prices(page=1, prices=[], depth=0, max_depth=100):

    if depth >= max_depth:
        return prices

    url = 'https://www.remax.ca/bc/vancouver--real-estate?page={page}'.format(page=page)
    
    r = requests.get(url)
    if not r:
        return prices
    if r.status_code != 200:
        return prices

    soup = BeautifulSoup(r.text, 'html.parser')
    price = soup.find_all('h3', {'class' : 'price'})
    price = pd.DataFrame([(p.text) for p in price]).rename(columns = {0:'Price'})

    prices.append(price)
    
    return get_prices(page=page+1, prices=prices, depth=depth+1)

prices = get_prices()

So the get_prices function first calls itself with the default parameters. Then it keeps on calling itself and appending additional prices to the prices function each time it does so, until it either reaches the point where the next page does not yield the status code 200, or it reaches the maximum recursion depth that you have specified.

Alternatively, if you don't like recursion or you need to query more than 1000 pages at a time, then you could use a simpler, but less interesting, while loop:

import requests
from bs4 import BeautifulSoup
import pandas as pd

def get_prices():

    prices=[]
    page = 1

    while True:

        url = 'https://www.remax.ca/bc/vancouver--real-estate?page={page}'.format(page=page)
        
        r = requests.get(url)
        if not r:
            break
        if r.status_code != 200:
            break

        soup = BeautifulSoup(r.text, 'html.parser')
        price = soup.find_all('h3', {'class' : 'price'})
        price = pd.DataFrame([(p.text) for p in price]).rename(columns = {0:'Price'})

        prices.append(price)

        page += 1
    
    return prices

prices = get_prices()

Upvotes: 2

7u5h4r
7u5h4r

Reputation: 457

Try with this

def get_data(price, page):
    url = 'https://www.remax.ca/bc/vancouver--real-estate?page='+page
    page = urlopen(url)
    soup = BeautifulSoup(page,'html.parser')
    price = soup.find_all('h3', {'class' : 'price'})
    price = pd.DataFrame([(p.text) for p in price]).rename(columns = {0:'Price'})

price = dict()
for page in itertools.count(start=1):
    try:
        get_data(price, str(page))
    except Exception:
        break

Upvotes: 0

fernand0
fernand0

Reputation: 310

Maybe you should change "get_data('1')" by "get_data(str(page))"?

Upvotes: -1

Related Questions