Reputation: 3
I'm trying to get all the prices from rightmove.co.uk as a learning exercise to better understand web scraping.
Here's my code:
class RightmoveScraper:
def fetch(self, url):
response = requests.get(url)
print('Status code : %s' % response.status_code)
return response
def parse(self, response):
soup = BeautifulSoup(response, 'lxml')
prices = [price.text for price in soup.find_all(
'div', {'class': 'propertyCard-priceValue'})]
print(prices)
def run(self):
response = self.fetch(
'https://www.rightmove.co.uk/overseas-property-for-sale/Paris.html')
self.parse(response.text)
When I run my scraper this is what prints out:
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
instead of getting the prices.
Can someone guide me through what I am doing wrong and give me a solution?
Upvotes: 0
Views: 208
Reputation: 23815
The information you are looking for is in the source of the web page.
However it is stored as a javascript data structure under the path /html/body/script[1]
All you have to do is read the content of the script (which is just a JSON), load the JSON into python dict.
See https://pastebin.com/rzG9YL0y for the data.
working code below:
import json
import pprint
import requests
r = requests.get('https://www.rightmove.co.uk/overseas-property-for-sale/Paris.html')
if r.status_code == 200:
search_term = '<script>window.jsonModel = '
body = r.content.decode('utf-8')
left_idx = body.find(search_term)
right_idx = body.find('</script>', left_idx)
offset = len(search_term)
data_str = body[left_idx + offset:right_idx]
# data holds the 'data model' of the page. the prices are there as well
data = json.loads(data_str)
props = data['properties']
for entry in props:
_id = entry['id']
price = entry['price']['amount']
print('{} --> {}'.format(_id, price))
output
81919186 --> 899000
94229627 --> 1930000
94115438 --> 5300000
94115432 --> 1490000
91433144 --> 840000
90987107 --> 758000
90987110 --> 935000
90987101 --> 1630000
90987104 --> 3064000
90987092 --> 1274500
90987098 --> 1981000
90834383 --> 3344000
90834386 --> 1140000
90834392 --> 431000
90834368 --> 630000
90666347 --> 452000
88743806 --> 5194000
90665516 --> 6250000
90665774 --> 1795000
73687471 --> 1890000
90665348 --> 10500000
69017641 --> 930000
69017644 --> 930000
90665852 --> 1790000
Upvotes: 0
Reputation: 552
When you scrape a website, never rely on what your browser tells you (at least, regarding the HTML elements). Browsers run JS scripts which can populate HTML elements.
If you just print response.text
in a file, and take a quick look at it. You will see that the <div class="propertyCard-priceValue">
tag is really empty. The reason behind could be that the prices are populated at load-time from the JS scripts.
Unfortunately the only solution to these kind of problems is running a browser from your python code. I suggest you take a look at how selenium
works.
Upvotes: 1