Reputation: 1
I'm building a web scraper for housing prices in the United States. An example of the data that I'm using can be found here. I'm trying to extract the data for the specific zip code (Studio: $1420, 1 Bedroom: $1560).
Here is the HTML portion of what I am trying to extract:
<tspan x="5" y="16" class="highcharts-text-outline" fill="#000000" stroke="#000000" stroke-width="2px" stroke-linejoin="round" style="">$1420</tspan>
When I try to use BeautifulSoup4, I this is what I have: import urllib.request as urllib2 from bs4 import BeautifulSoup
# specify the url
quote_page = 'https://www.bestplaces.net/cost_of_living/zip-
code/california/san_diego/92128'
# query the website and return the html to the variable ‘page’
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
price = soup.find('tspan', attrs={'class': 'highcharts-text-outline'})
print(price)
But this returns nothing. I am wondering how I can change my command to properly extract this.
Upvotes: 0
Views: 450
Reputation: 22440
You are trying to parse a dynamic content using urllib
library which is unable to do the job. You need to use any browser simulator like selenium
to deal with that. Here is how you can go using selenium
:
from selenium.webdriver import Chrome
from contextlib import closing
with closing(Chrome()) as driver:
quote_page = 'https://www.bestplaces.net/cost_of_living/zip-code/california/san_diego/92128'
driver.get(quote_page)
price = driver.find_element_by_class_name('highcharts-text-outline').text
print(price)
Output:
$1420
Upvotes: 1
Reputation: 174
Try this:-
price = soup.find('tspan',{'class':['highcharts-text-outline']})
price.text
Upvotes: 0
Reputation: 71451
You can use the text
attribute:
from bs4 import BeautifulSoup as soup
s = '<tspan x="5" y="16" class="highcharts-text-outline" fill="#000000" stroke="#000000" stroke-width="2px" stroke-linejoin="round" style="">$1420</tspan>'
result = soup(s, 'lxml').find('tspan').text
Output:
u'$1420'
Upvotes: 0