Reputation: 25
I'm trying to extract data from the internet. My code goes through the first loop smoothly, prints and loads data to the file but it won't print data for the next pages. Not I am using a python 3 notebook. Here is my python code.
import urllib3
from bs4 import BeautifulSoup as soup
from time import sleep
from random import randint
import pandas as pd
http = urllib3.PoolManager()
filename = "GautengForSale.csv"
f = open(filename, "w")
headers = "Description, Location, Price, Bedrooms, Bathrooms, Parking, FloorSize\n"
f.write(headers)
for page in range(1, 5):
url = 'https://www.property24.com/for-sale/gauteng/1/p'+str(page)+'?PropertyCategory=House%2cApartmentOrFlat%2cTownhouse'
page_html = http.request('GET', url)
page_soup = soup(page_html.data)
containers = page_soup.findAll("div", {"class": "p24_content"})
sleep(randint(2,10))
for container in containers:
description_container = container.findAll("div", {"class": "p24_description"})
if not description_container:
continue
else:
description = description_container[0].text
location_container = container.findAll("span", {"class": "p24_location"})
location = location_container[0].text
price_container = container.findAll("div", {"class": "p24_price"})
price = price_container[0].text.strip()
bedrooms_container = container.findAll("span", {"class": "p24_featureDetails", "title": "Bedrooms"})
if not bedrooms_container:
bedrooms = 0
else:
bedrooms = bedrooms_container[0].text.strip()
bathrooms_container = container.findAll("span", {"class": "p24_featureDetails", "title": "Bathrooms"})
if not bathrooms_container:
bathrooms = 1
else:
bathrooms = bathrooms_container[0].text.strip()
parking_container = container.findAll("span", {"class": "p24_featureDetails", "title": "Parking Spaces"})
if not parking_container:
parking = 0
else:
parking = parking_container[0].text.strip()
floor_size_container = container.findAll("span", {"class": "p24_size", "title": "Floor Size"})
if not floor_size_container:
floor_size = "n/a"
else:
floor_size = floor_size_container[0].text.strip()
print(str(description) + "," + str(location) + "," + str(price) + "," + str(bedrooms) + "," + str(bathrooms) + "," + str(parking) + "," + str(floor_size) + "\n")
f.write(str(description) + "," + str(location) + "," + str(price) + "," + str(bedrooms) + "," + str(bathrooms) + "," + str(parking) + "," + str(floor_size) + "\n")
f.close()
I'm not sure where I went wrong.
Upvotes: 1
Views: 98
Reputation: 195408
There are 2 problems:
1.) page_soup.findAll("div", {"class": "p24_content"})
should be page_soup.select(".p24_content"):
, because the page varies <div>
and <span>
tags with this class
2.) container.findAll("div", {"class": "p24_description"})
should be container.select_one(".p24_description, .p24_title")
because class p24_description
is only present on some pages
import requests
from bs4 import BeautifulSoup
for page in range(1, 5):
url = 'https://www.property24.com/for-sale/gauteng/1/p'+str(page)+'?PropertyCategory=House%2cApartmentOrFlat%2cTownhouse'
page_soup = BeautifulSoup( requests.get(url).content, 'html.parser' )
for container in page_soup.select(".p24_content"):
description_container = container.select_one(".p24_description, .p24_title")
if not description_container:
continue
else:
description = description_container.get_text(strip=True)
location_container = container.select_one(".p24_location")
location = location_container.get_text(strip=True)
price_container = container.select_one(".p24_price")
price = price_container.text.strip()
bedrooms_container = container.find("span", {"class": "p24_featureDetails", "title": "Bedrooms"})
if not bedrooms_container:
bedrooms = 0
else:
bedrooms = bedrooms_container.text.strip()
bathrooms_container = container.find("span", {"class": "p24_featureDetails", "title": "Bathrooms"})
if not bathrooms_container:
bathrooms = 1
else:
bathrooms = bathrooms_container.text.strip()
parking_container = container.find("span", {"class": "p24_featureDetails", "title": "Parking Spaces"})
if not parking_container:
parking = 0
else:
parking = parking_container.text.strip()
floor_size_container = container.find("span", {"class": "p24_size", "title": "Floor Size"})
if not floor_size_container:
floor_size = "n/a"
else:
floor_size = floor_size_container.text.strip()
print('{},{},{},{},{},{},{}'.format(description, location, price, bedrooms, bathrooms, parking, floor_size))
Prints:
5 Bedroom Townhouse inFourways,Fourways,R 5 890 000,5,5.5,2,457 m²
1 Bedroom Apartment inGrand Central,Grand Central,R 450 000,1,1,0,n/a
5 Bedroom House inWilro Park,Wilro Park,R 1 595 000,5,3,4,n/a
1 Bedroom Apartment inProtea Glen,Protea Glen,R 413 000,1,1,0,n/a
3 Bedroom Townhouse inWillowbrook,Willowbrook,R 1 350 000,3,2,4,n/a
2 Bedroom Apartment inWinchester Hills,Winchester Hills,R 650 000,2,1,1,69 m²
2 Bedroom Townhouse inElarduspark,Elarduspark,R 960 000,2,2,2,n/a
1 Bedroom House,Langaville,R 180 000,1,1,0,n/a
2 Bedroom Townhouse inProtea Glen,Protea Glen,R 565 000,2,1,1,50 m²
4 Bedroom House inSunninghill,Sunninghill,R 3 245 000,4,3.5,1,240 m²
1 Bedroom Apartment inRandpark Ridge,Randpark Ridge,R 807 700,1,1,1,51 m²
3 Bedroom House inGlenvista,Glenvista,R 2 500 000,3,2,3,n/a
4 Bedroom House inMeyersdal Nature Estate,Meyersdal Nature Estate,R 2 695 000,4,3,2,n/a
House,Geduld,R 750 000,0,1,0,n/a
3 Bedroom House,The Orchards,R 750 000,3,2,1,n/a
1 Bedroom Apartment,Kempton Park Central,POA,1,1,1,n/a
Apartment,Fourways,R 889 000,0,1,0,n/a
2 Bedroom Townhouse,Highveld,R 1 195 000,2,1.5,1,n/a
3 Bedroom House,Delville,R 1 300 000,3,1,5,n/a
5 Bedroom House,Northcliff,R 3 450 000,5,3.5,6,n/a
1 Bedroom House,Langaville,R 180 000,1,1,0,n/a
1 Bedroom House,Vlakfontein,R 170 000,1,1,1,n/a
5 Bedroom Townhouse inFourways,Fourways,R 5 890 000,5,5.5,2,457 m²
3 Bedroom Apartment,Andeon,R 860 000,3,2,2,n/a
2 Bedroom Apartment,Vereeniging Central,R 435 000,2,1.5,1,77 m²
3 Bedroom House,Eldoraigne,R 1 750 000,3,2,3,n/a
3 Bedroom House,Moreleta Park,R 2 990 000,3,2.5,2,n/a
2 Bedroom Apartment,Kyalami Hills,R 1 235 000,2,2,1,97 m²
... and so on.
Upvotes: 1
Reputation: 2170
It looks like the p24_content
class is applied to a span
tag starting from the second page. A solution could be:
containers = page_soup.findAll(["div", "span"], {"class": "p24_content"})
... if I read the bs4
documentation right.
Maybe there is even more things that change. I didn't check :)
Upvotes: 2