Reputation: 41
I am trying to extract the text "This station managed by the Delta Flow Projects Office", from this website: https://waterdata.usgs.gov/ca/nwis/uv?site_no=381504121404001. This line is located under the div class stationContainer
. Since this is a dynamic webpage, I'm using selenium to derive the html.
This is the html from the website.
This is my code:
from selenium import webdriver
from selenium.webdriver.common.by import By
browser = webdriver.Chrome()
url = "https://waterdata.usgs.gov/ca/nwis/uv?site_no=381504121404001"
browser.get(url) #navigate to the page
innerHTML = browser.execute_script("return document.body.innerHTML")
elem = browser.find_elements_by_xpath("//div[@class='stationContainer']")
print (elem)
I get the this result from my print message:
selenium.webdriver.remote.webelement.WebElement (session="96fc124c0e2d1fd4cd86f61db272d52a", element="0.5862443940581294-1")
I'm hoping to derive the text by searching through the div class, but it seems I'm not going about this the right way.
Upvotes: 3
Views: 462
Reputation: 12972
Well, the content you want to scrap is not actually dynamic. You can use bs4
to fetch the div class stationContainer
content. What makes this a bit challenging is that the element you're searching is not between certain tags. So a solution to this is an easy string manipulation to extract the content between the </form>
and the <br/><br/>
tag, like so:
from bs4 import BeautifulSoup
from requests import get
soup = BeautifulSoup(get('https://your_url_here').text, "html.parser")
for i in soup.find_all('div', attrs={'class':"stationContainer"}):
print str(i).split('</form>')[1].split('<br/><br/>')[0].strip()
This code produces the appropriate result!
Upvotes: 0
Reputation: 642
print (elem.text)
elem
is a WebElement object, hence the printed message. If you want to access the text, you need to add .text
to the end, or if you want to grab some other attribute you can do something like elem.get_attribute('innerHTML')
.
Also, since the div element has a lot of other text, you're going to be getting a lot more text than what you want. I haven't looked into other similar pages, but perhaps you could extract what's between </form>
and <br><br>
in the div's html.
Upvotes: 1
Reputation: 1314
elem
is a list not a string
. Try this:
elem = browser.find_elements_by_xpath("//div[@class='stationContainer']")[0]
print elem.text
That prints out all the content. So you probably need a better selector or a way to parse the rest of it out.
Upvotes: 1