Reputation: 151
I'm trying to extract:
<div class="xl-surface-ch">
84 m² 2 bed.
</div>
from link the problem is, I only need the "84" in this string (they sometimes go over 2 or 3 digits as well).
Added difficulty is that sometimes the square meters are not mentioned, which looks like this:
<div class="xl-surface-ch">
2 bed.
</div>
and in that case I'd need to return a 0
My best attempt is:
sqm = []
for item in soup.findAll('div', attrs={'class': 'xl-surface-ch'}):
item = item.contents[0].strip()[0:4]
item_clean = re.findall("[0-9]{2,4}", item)
sqm.append(item_clean)
print(sqm)
But this doesn't seem to work and won't be at all what I need for the end result as stated above. Here's the result I'm getting with my code:
[['84'], ['70'], ['80'], ['32'], ['149'], ['22'], ['75'], ['30'], ['23'], ['104'], [], ['95'], ['129'], ['26'], ['55'], ['26'], ['25'], ['28'], ['33'], ['210'], ['37'], ['69'], ['36'], ['19'], ['119'], ['20'], ['20'], ['129'], ['154'], ['25']]
Would be really interested in what kinds of solution you guys cook up because I honestly think there isn't really a solution, especially since you sometimes have buildings without the sqm... maybe with an if statement? I'm going to try that right now anyhow.
Thank you in advance!
Upvotes: 0
Views: 247
Reputation: 11525
import requests
from bs4 import BeautifulSoup
r = requests.get(
'https://www.immoweb.be/en/search/apartment/for-sale/leuven/3000')
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll('div', attrs={'class': 'xl-surface-ch'}):
item = item.text.strip()
if 'm²' in item:
print(item[0:item.find('m')])
else:
item = 0
print(item)
Output:
84
70
80
32
149
22
75
30
23
104
0
95
129
26
55
26
25
28
33
210
37
69
36
19
119
20
20
129
154
25
Upvotes: 2