Ilovenoodles
Ilovenoodles

Reputation: 83

How to iterate through multiple pages in python when NOT knowing the last page

I want to scrape information using BeautifulSoup and iterate through multiple pages. I know how to do this by writing for page in range(1, 3) for example if I want the info on first 2 pages. However, the information is dynamic and the number of pages will increase. So how can I iterate when I don't know the last page? Currently I have the following code:

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

headers = {'user-agent': 'Mozilla/5.0'}

listing_details = []

for page in range(1,3):
    response = requests.get('https://www.realestate.co.nz/residential/sale/auckland?by=latest&oad=true&page={}&pm=1'.format(page), headers=headers)
    listings = BeautifulSoup(response.content, "lxml")
    
    details = listings.findAll('div', attrs={"data-test":"tile"})
    for detail in details:

        # get property links
        links = detail.findAll('a', href=True)
        for link in links:
            link="https://www.realestate.co.nz" + link['href']

        listing_details.append([link])

df3 = pd.DataFrame(listing_details, columns=['Link'])
print(df3)

Upvotes: 0

Views: 2416

Answers (2)

QHarr
QHarr

Reputation: 84465

You can do a while True and break when next element is no longer present (or add in chosen end page number - which ever comes first)

import requests
from bs4 import BeautifulSoup as bs

page = 0

with requests.Session() as s:

    s.headers = {'User-Agent':'Mozilla/5.0'}

    while True:
        page+=1
        r = s.get(f'https://www.realestate.co.nz/residential/sale/auckland?by=latest&oad=true&page={page}&pm=1')
        soup = bs(r.content, 'lxml')
        next_page = soup.select_one('[data-test=next-link]')
        
        if next_page is None:
            break
        print(page)

You could also calculate from info provided within a script tag (this shows the idea of adding in a target number of pages):

import requests, re
from bs4 import BeautifulSoup as bs
import math

target_pages = 3

with requests.Session() as s:
    
    s.headers = {'User-Agent':'Mozilla/5.0'}
    r = s.get(f'https://www.realestate.co.nz/residential/sale/auckland?by=latest&oad=true&page=1&pm=1')
    counts = re.search(r'"totalResults\\":(?P<total>\d+),\\"resultsPerPage\\":(?P<perpage>\d+),', r.text, re.M)
    num_pages = math.ceil(int(counts.group('total'))/int(counts.group('perpage')))
    #print(int(counts.group('total')))
    print(num_pages)
    n = 2
    
    while n <= min(num_pages, target_pages):
        r = s.get(f'https://www.realestate.co.nz/residential/sale/auckland?by=latest&oad=true&page={n}&pm=1')
        print(n)
        n+=1

Upvotes: 1

GabrielBoehme
GabrielBoehme

Reputation: 322

One solution might be create a while loop and iter in the pages, adding +1 to each loop. When the page content is broke or status code 404, break.

Upvotes: 0

Related Questions