Tapiwa Sibanda
Tapiwa Sibanda

Reputation: 25

Scraping multiple pages on a Webpage

I'm trying to extract data from the internet. My code goes through the first loop smoothly, prints and loads data to the file but it won't print data for the next pages. Not I am using a python 3 notebook. Here is my python code.

    import urllib3
    from bs4 import BeautifulSoup as soup
    from time import sleep
    from random import randint
    import pandas as pd
    http = urllib3.PoolManager()

filename = "GautengForSale.csv"
f = open(filename, "w")
headers = "Description, Location, Price, Bedrooms, Bathrooms, Parking, FloorSize\n"
f.write(headers)


for page in range(1, 5):
    
    url = 'https://www.property24.com/for-sale/gauteng/1/p'+str(page)+'?PropertyCategory=House%2cApartmentOrFlat%2cTownhouse'
    page_html = http.request('GET', url)
    page_soup = soup(page_html.data)
    containers = page_soup.findAll("div", {"class": "p24_content"})
    
    sleep(randint(2,10))
    
    for container in containers:
        
        description_container = container.findAll("div", {"class": "p24_description"})
        if not description_container:
            continue
        else:
            description = description_container[0].text
    
        location_container = container.findAll("span", {"class": "p24_location"})
        location = location_container[0].text
   
        price_container = container.findAll("div", {"class": "p24_price"})
        price = price_container[0].text.strip()
        
        bedrooms_container = container.findAll("span", {"class": "p24_featureDetails", "title": "Bedrooms"})
        if not bedrooms_container:
            bedrooms = 0
        else:
            bedrooms = bedrooms_container[0].text.strip()
        
        bathrooms_container = container.findAll("span", {"class": "p24_featureDetails", "title": "Bathrooms"})
        if not bathrooms_container:
            bathrooms = 1
        else:
            bathrooms = bathrooms_container[0].text.strip()
        
        parking_container = container.findAll("span", {"class": "p24_featureDetails", "title": "Parking Spaces"})
        if not parking_container:
            parking = 0
        else:
            parking = parking_container[0].text.strip()
        
        floor_size_container = container.findAll("span", {"class": "p24_size", "title": "Floor Size"})
        if not floor_size_container:
            floor_size = "n/a"
        else:
            floor_size = floor_size_container[0].text.strip()

        print(str(description) + "," + str(location) + "," + str(price) + "," + str(bedrooms) + "," + str(bathrooms) + "," + str(parking) + "," + str(floor_size) + "\n")
        f.write(str(description) + "," + str(location) + "," + str(price) + "," + str(bedrooms) + "," + str(bathrooms) + "," + str(parking) + "," + str(floor_size) + "\n")

f.close()

I'm not sure where I went wrong.

Upvotes: 1

Views: 98

Answers (2)

Andrej Kesely
Andrej Kesely

Reputation: 195408

There are 2 problems:

1.) page_soup.findAll("div", {"class": "p24_content"}) should be page_soup.select(".p24_content"):, because the page varies <div> and <span> tags with this class

2.) container.findAll("div", {"class": "p24_description"}) should be container.select_one(".p24_description, .p24_title") because class p24_description is only present on some pages

import requests
from bs4 import BeautifulSoup


for page in range(1, 5):
    url = 'https://www.property24.com/for-sale/gauteng/1/p'+str(page)+'?PropertyCategory=House%2cApartmentOrFlat%2cTownhouse'

    page_soup = BeautifulSoup( requests.get(url).content, 'html.parser' )

    for container in page_soup.select(".p24_content"):
        description_container = container.select_one(".p24_description, .p24_title")
        if not description_container:
            continue
        else:
            description = description_container.get_text(strip=True)

        location_container = container.select_one(".p24_location")
        location = location_container.get_text(strip=True)

        price_container = container.select_one(".p24_price")
        price = price_container.text.strip()

        bedrooms_container = container.find("span", {"class": "p24_featureDetails", "title": "Bedrooms"})
        if not bedrooms_container:
            bedrooms = 0
        else:
            bedrooms = bedrooms_container.text.strip()

        bathrooms_container = container.find("span", {"class": "p24_featureDetails", "title": "Bathrooms"})
        if not bathrooms_container:
            bathrooms = 1
        else:
            bathrooms = bathrooms_container.text.strip()

        parking_container = container.find("span", {"class": "p24_featureDetails", "title": "Parking Spaces"})
        if not parking_container:
            parking = 0
        else:
            parking = parking_container.text.strip()

        floor_size_container = container.find("span", {"class": "p24_size", "title": "Floor Size"})
        if not floor_size_container:
            floor_size = "n/a"
        else:
            floor_size = floor_size_container.text.strip()

        print('{},{},{},{},{},{},{}'.format(description, location, price, bedrooms, bathrooms, parking, floor_size))

Prints:

5 Bedroom Townhouse inFourways,Fourways,R 5 890 000,5,5.5,2,457 m²
1 Bedroom Apartment inGrand Central,Grand Central,R 450 000,1,1,0,n/a
5 Bedroom House inWilro Park,Wilro Park,R 1 595 000,5,3,4,n/a
1 Bedroom Apartment inProtea Glen,Protea Glen,R 413 000,1,1,0,n/a
3 Bedroom Townhouse inWillowbrook,Willowbrook,R 1 350 000,3,2,4,n/a
2 Bedroom Apartment inWinchester Hills,Winchester Hills,R 650 000,2,1,1,69 m²
2 Bedroom Townhouse inElarduspark,Elarduspark,R 960 000,2,2,2,n/a
1 Bedroom House,Langaville,R 180 000,1,1,0,n/a
2 Bedroom Townhouse inProtea Glen,Protea Glen,R 565 000,2,1,1,50 m²
4 Bedroom House inSunninghill,Sunninghill,R 3 245 000,4,3.5,1,240 m²
1 Bedroom Apartment inRandpark Ridge,Randpark Ridge,R 807 700,1,1,1,51 m²
3 Bedroom House inGlenvista,Glenvista,R 2 500 000,3,2,3,n/a
4 Bedroom House inMeyersdal Nature Estate,Meyersdal Nature Estate,R 2 695 000,4,3,2,n/a
House,Geduld,R 750 000,0,1,0,n/a
3 Bedroom House,The Orchards,R 750 000,3,2,1,n/a
1 Bedroom Apartment,Kempton Park Central,POA,1,1,1,n/a
Apartment,Fourways,R 889 000,0,1,0,n/a
2 Bedroom Townhouse,Highveld,R 1 195 000,2,1.5,1,n/a
3 Bedroom House,Delville,R 1 300 000,3,1,5,n/a
5 Bedroom House,Northcliff,R 3 450 000,5,3.5,6,n/a
1 Bedroom House,Langaville,R 180 000,1,1,0,n/a
1 Bedroom House,Vlakfontein,R 170 000,1,1,1,n/a
5 Bedroom Townhouse inFourways,Fourways,R 5 890 000,5,5.5,2,457 m²
3 Bedroom Apartment,Andeon,R 860 000,3,2,2,n/a
2 Bedroom Apartment,Vereeniging Central,R 435 000,2,1.5,1,77 m²
3 Bedroom House,Eldoraigne,R 1 750 000,3,2,3,n/a
3 Bedroom House,Moreleta Park,R 2 990 000,3,2.5,2,n/a
2 Bedroom Apartment,Kyalami Hills,R 1 235 000,2,2,1,97 m²

... and so on.

Upvotes: 1

matheburg
matheburg

Reputation: 2170

It looks like the p24_content class is applied to a span tag starting from the second page. A solution could be:

containers = page_soup.findAll(["div", "span"], {"class": "p24_content"})

... if I read the bs4 documentation right.

Maybe there is even more things that change. I didn't check :)

Upvotes: 2

Related Questions