Ilovenoodles
Ilovenoodles

Reputation: 83

Scraping elements with the same tag and without class and id attributes

I want to scrape the number of bedrooms and bathrooms and the land area for each property separately from a real estate webpage. However, I found that their tags are the same which is <strong>, there are no class and id either. Therefore, when I write the following code:

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
url = "https://www.realestate.co.nz/residential/sale/auckland?oad=true&pm=1"
response = requests.get(url, headers=headers)
content = BeautifulSoup(response.content, "lxml")

rooms = content.findAll('strong', class_=False, id=False)
for room in rooms:
    print(room.text)

I get the following:

Sign up
2
2
2
2
3
2
4
3
2.4ha
2
1
2
2
4
3
465m2
1
1
3
2
1
1
5
3
10.1ha
3
2
5
5
600m2
600m2
4
2
138m2
2
1
2
1
2
2
3
2
675m2
2
1

You can see that I got them all together because they are having the same tag. Can someone help me how to get them all but separately? Thanks!

Upvotes: 1

Views: 209

Answers (2)

QHarr
QHarr

Reputation: 84455

I would loop over the main tiles and attempt to select for each target node e.g. by its unique class within the html for that tile. You can use if else with test of not None to add a default value where missing. To handle different sort order, I also added a try except. I went with sort by latest, but also tested with your sort order.

I added in a few more items to give context. It would be easy to expand this to loop pages, but that is beyond the scope of your question, and would be a candidate for a new question once you have tried extending if required.

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import numpy as np

#'https://www.realestate.co.nz/residential/sale/auckland?oad=true&pm=1'

r = requests.get('https://www.realestate.co.nz/residential/sale/auckland?by=latest&oad=true&pm=1',
                  headers = {'User-Agent':'Mozilla/5.0'}).text
soup = bs(r, 'lxml')
main_listings = soup.select('.listing-tile')
base = 'https://www.realestate.co.nz/4016546/residential/sale/'
results = {}

for listing in main_listings:
    
    try:
        date = listing.select_one('.listed-date > span').next_sibling.strip()
    except:
        date = listing.select_one('.listed-date').text.strip()

    title = listing.select_one('h3').text.strip()
    listing_id = listing.select_one('a')['id']
    url = base + listing_id
    
    bedrooms = listing.select_one('.icon-bedroom + strong')
    
    if bedrooms is not None:
        bedrooms = int(bedrooms.text)
    else:
        bedrooms = np.nan
    
    bathrooms = listing.select_one('.icon-bathroom + strong')
    
    if bathrooms is not None:
        bathrooms = int(bathrooms.text)
    else:
        bathrooms = np.nan
    
    land_area = listing.select_one('icon-land-area + strong')
    
    if land_area is not None:
        land_area = land_area.text
    else:
        land_area = "Not specified"
    
    price = listing.select_one('.text-right').text
    
    results[listing_id] = [date, title,  url, bedrooms, bathrooms, land_area, price]
    
df = pd.DataFrame(results).T
df.columns = ['Listing Date', 'Title', 'Url', '#Bedroom', '#Bathrooms', 'Land Area', 'Price']
print(df)

Upvotes: 1

Bhavya Parikh
Bhavya Parikh

Reputation: 3400

Find main tile means div tag which contains the info regarding property also in some of them data is missing like area,bathroom or etc. so you can try this approach!

from bs4 import BeautifulSoup
import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
url = "https://www.realestate.co.nz/residential/sale/auckland?oad=true&pm=1"
response = requests.get(url, headers=headers)
content = BeautifulSoup(response.content, "lxml")

rooms = content.find_all('div', attrs={'data-test':"tile"})
dict1={}
for room in rooms:
    apart=room.find_all('strong',class_=False)
    if len(apart)==3:
        for apa in apart:
            dict1['bedroom']=apart[0].text
            dict1['bathroom']=apart[1].text
            dict1['area']=apart[2].text

    elif len(apart)==2:
        for apa in apart:
            dict1['bedroom']=apart[0].text
            dict1['bathroom']=apart[1].text
            dict1['area']="NA"
    else:
        for apa in apart:
            dict1['bedroom']="NA"
            dict1['bathroom']="NA"
            dict1['area']=apart[0].text
    print(dict1)

Output:

{'bedroom': '2', 'bathroom': '2', 'area': 'NA'}
{'bedroom': '2', 'bathroom': '2', 'area': 'NA'}
{'bedroom': '3', 'bathroom': '2', 'area': 'NA'}
{'bedroom': '4', 'bathroom': '3', 'area': '2.4ha'}
{'bedroom': '2', 'bathroom': '1', 'area': 'NA'}
...

Upvotes: 1

Related Questions