Reputation: 83
I want to scrape the number of bedrooms and bathrooms and the land area for each property separately from a real estate webpage. However, I found that their tags are the same which is <strong>
, there are no class and id either. Therefore, when I write the following code:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
url = "https://www.realestate.co.nz/residential/sale/auckland?oad=true&pm=1"
response = requests.get(url, headers=headers)
content = BeautifulSoup(response.content, "lxml")
rooms = content.findAll('strong', class_=False, id=False)
for room in rooms:
print(room.text)
I get the following:
Sign up
2
2
2
2
3
2
4
3
2.4ha
2
1
2
2
4
3
465m2
1
1
3
2
1
1
5
3
10.1ha
3
2
5
5
600m2
600m2
4
2
138m2
2
1
2
1
2
2
3
2
675m2
2
1
You can see that I got them all together because they are having the same tag. Can someone help me how to get them all but separately? Thanks!
Upvotes: 1
Views: 209
Reputation: 84455
I would loop over the main tiles and attempt to select for each target node e.g. by its unique class within the html for that tile. You can use if else with test of not None to add a default value where missing. To handle different sort order, I also added a try except. I went with sort by latest, but also tested with your sort order.
I added in a few more items to give context. It would be easy to expand this to loop pages, but that is beyond the scope of your question, and would be a candidate for a new question once you have tried extending if required.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import numpy as np
#'https://www.realestate.co.nz/residential/sale/auckland?oad=true&pm=1'
r = requests.get('https://www.realestate.co.nz/residential/sale/auckland?by=latest&oad=true&pm=1',
headers = {'User-Agent':'Mozilla/5.0'}).text
soup = bs(r, 'lxml')
main_listings = soup.select('.listing-tile')
base = 'https://www.realestate.co.nz/4016546/residential/sale/'
results = {}
for listing in main_listings:
try:
date = listing.select_one('.listed-date > span').next_sibling.strip()
except:
date = listing.select_one('.listed-date').text.strip()
title = listing.select_one('h3').text.strip()
listing_id = listing.select_one('a')['id']
url = base + listing_id
bedrooms = listing.select_one('.icon-bedroom + strong')
if bedrooms is not None:
bedrooms = int(bedrooms.text)
else:
bedrooms = np.nan
bathrooms = listing.select_one('.icon-bathroom + strong')
if bathrooms is not None:
bathrooms = int(bathrooms.text)
else:
bathrooms = np.nan
land_area = listing.select_one('icon-land-area + strong')
if land_area is not None:
land_area = land_area.text
else:
land_area = "Not specified"
price = listing.select_one('.text-right').text
results[listing_id] = [date, title, url, bedrooms, bathrooms, land_area, price]
df = pd.DataFrame(results).T
df.columns = ['Listing Date', 'Title', 'Url', '#Bedroom', '#Bathrooms', 'Land Area', 'Price']
print(df)
Upvotes: 1
Reputation: 3400
Find main tile means div tag which contains the info regarding property also in some of them data is missing like area,bathroom or etc. so you can try this approach!
from bs4 import BeautifulSoup
import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
url = "https://www.realestate.co.nz/residential/sale/auckland?oad=true&pm=1"
response = requests.get(url, headers=headers)
content = BeautifulSoup(response.content, "lxml")
rooms = content.find_all('div', attrs={'data-test':"tile"})
dict1={}
for room in rooms:
apart=room.find_all('strong',class_=False)
if len(apart)==3:
for apa in apart:
dict1['bedroom']=apart[0].text
dict1['bathroom']=apart[1].text
dict1['area']=apart[2].text
elif len(apart)==2:
for apa in apart:
dict1['bedroom']=apart[0].text
dict1['bathroom']=apart[1].text
dict1['area']="NA"
else:
for apa in apart:
dict1['bedroom']="NA"
dict1['bathroom']="NA"
dict1['area']=apart[0].text
print(dict1)
Output:
{'bedroom': '2', 'bathroom': '2', 'area': 'NA'}
{'bedroom': '2', 'bathroom': '2', 'area': 'NA'}
{'bedroom': '3', 'bathroom': '2', 'area': 'NA'}
{'bedroom': '4', 'bathroom': '3', 'area': '2.4ha'}
{'bedroom': '2', 'bathroom': '1', 'area': 'NA'}
...
Upvotes: 1