Alan Tang
Alan Tang

Reputation: 1

Using Beautiful Soup to Extract Nested Data in HTML

I'm building a web scraper for housing prices in the United States. An example of the data that I'm using can be found here. I'm trying to extract the data for the specific zip code (Studio: $1420, 1 Bedroom: $1560).

Here is the HTML portion of what I am trying to extract:

<tspan x="5" y="16" class="highcharts-text-outline" fill="#000000" stroke="#000000" stroke-width="2px" stroke-linejoin="round" style="">$1420</tspan>

When I try to use BeautifulSoup4, I this is what I have: import urllib.request as urllib2 from bs4 import BeautifulSoup

# specify the url
quote_page = 'https://www.bestplaces.net/cost_of_living/zip-
code/california/san_diego/92128'

# query the website and return the html to the variable ‘page’
page = urllib2.urlopen(quote_page)


soup = BeautifulSoup(page, 'html.parser')
price = soup.find('tspan', attrs={'class': 'highcharts-text-outline'})

print(price)

But this returns nothing. I am wondering how I can change my command to properly extract this.

Upvotes: 0

Views: 450

Answers (3)

SIM
SIM

Reputation: 22440

You are trying to parse a dynamic content using urllib library which is unable to do the job. You need to use any browser simulator like selenium to deal with that. Here is how you can go using selenium:

from selenium.webdriver import Chrome
from contextlib import closing

with closing(Chrome()) as driver:
    quote_page = 'https://www.bestplaces.net/cost_of_living/zip-code/california/san_diego/92128'
    driver.get(quote_page)
    price = driver.find_element_by_class_name('highcharts-text-outline').text
    print(price)

Output:

$1420

Upvotes: 1

Manish Mahendru
Manish Mahendru

Reputation: 174

Try this:-

price = soup.find('tspan',{'class':['highcharts-text-outline']})

price.text

Upvotes: 0

Ajax1234
Ajax1234

Reputation: 71451

You can use the text attribute:

from bs4 import BeautifulSoup as soup
s = '<tspan x="5" y="16" class="highcharts-text-outline" fill="#000000" stroke="#000000" stroke-width="2px" stroke-linejoin="round" style="">$1420</tspan>'
result = soup(s, 'lxml').find('tspan').text

Output:

u'$1420'

Upvotes: 0

Related Questions