iiioan
iiioan

Reputation: 3

BeautifulSoup find_all returns list of empty strings

I'm trying to get all the prices from rightmove.co.uk as a learning exercise to better understand web scraping.

Here's my code:

class RightmoveScraper:
def fetch(self, url):
    response = requests.get(url)
    print('Status code : %s' % response.status_code)
    return response

def parse(self, response):
    soup = BeautifulSoup(response, 'lxml')
    prices = [price.text for price in soup.find_all(
        'div', {'class': 'propertyCard-priceValue'})]
    print(prices)


def run(self):
    response = self.fetch(
        'https://www.rightmove.co.uk/overseas-property-for-sale/Paris.html')
    self.parse(response.text)

When I run my scraper this is what prints out:

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

instead of getting the prices.

Can someone guide me through what I am doing wrong and give me a solution?

Upvotes: 0

Views: 208

Answers (2)

balderman
balderman

Reputation: 23815

The information you are looking for is in the source of the web page.

However it is stored as a javascript data structure under the path /html/body/script[1]

All you have to do is read the content of the script (which is just a JSON), load the JSON into python dict.

See https://pastebin.com/rzG9YL0y for the data.

working code below:

import json
import pprint
import requests

r = requests.get('https://www.rightmove.co.uk/overseas-property-for-sale/Paris.html')
if r.status_code == 200:
    search_term = '<script>window.jsonModel = '
    body = r.content.decode('utf-8')
    left_idx = body.find(search_term)
    right_idx = body.find('</script>', left_idx)
    offset = len(search_term)
    data_str = body[left_idx + offset:right_idx]
    # data holds the 'data model' of the page. the prices are there as well
    data = json.loads(data_str)
    props = data['properties']
    for entry in props:
      _id = entry['id']
      price = entry['price']['amount']
      print('{} --> {}'.format(_id, price))

output

81919186 --> 899000
94229627 --> 1930000
94115438 --> 5300000
94115432 --> 1490000
91433144 --> 840000
90987107 --> 758000
90987110 --> 935000
90987101 --> 1630000
90987104 --> 3064000
90987092 --> 1274500
90987098 --> 1981000
90834383 --> 3344000
90834386 --> 1140000
90834392 --> 431000
90834368 --> 630000
90666347 --> 452000
88743806 --> 5194000
90665516 --> 6250000
90665774 --> 1795000
73687471 --> 1890000
90665348 --> 10500000
69017641 --> 930000
69017644 --> 930000
90665852 --> 1790000

Upvotes: 0

DanBrezeanu
DanBrezeanu

Reputation: 552

When you scrape a website, never rely on what your browser tells you (at least, regarding the HTML elements). Browsers run JS scripts which can populate HTML elements.

If you just print response.text in a file, and take a quick look at it. You will see that the <div class="propertyCard-priceValue"> tag is really empty. The reason behind could be that the prices are populated at load-time from the JS scripts.

Unfortunately the only solution to these kind of problems is running a browser from your python code. I suggest you take a look at how selenium works.

Upvotes: 1

Related Questions