Edison
Edison

Reputation: 11987

regex with bs4 is splitting the results

My regex is producing split results so I have to subscript for a quick fix.

Code

my_url = 'https://www.zoopla.co.uk/for-sale/property/b23/?page_size=100&q=B23&radius=0&results_sort=newest_listings&search_source=refine'

house_listings = page_soup.findAll("div", {"class":"listing-results-right clearfix"})

listings = house_listings[3] # item 3 for prototyping

house_type = re.findall('(?:(?!.for).)*', str(listings.h2.a.text))

print(house_type)
# `['4 bed detached house', '', 'for sale', '']`

Fix

house_type = re.findall('(?:(?!.for).)*', str(listings.h2.a.text))[0]
print(house_type)
# 4 bed detached house

But beyond that, I need a new regex for better matching.

Desired Match
start from the word after 'bed' (minus the following space) and ignore the "for sale" portion.
e.g. results: detached house, terrace house, semi-detached house, flat, maisonette.

Source https://www.zoopla.co.uk/for-sale/property/b23/?page_size=100&q=B23&radius=0&results_sort=newest_listings&search_source=refine

Upvotes: 0

Views: 72

Answers (1)

jdaz
jdaz

Reputation: 6053

This should be all you need:

(?<=bed ).*(?= for)

Demo

Upvotes: 1

Related Questions