saoirse
saoirse

Reputation: 41

Deriving text from Javascript webpage using Selenium

I am trying to extract the text "This station managed by the Delta Flow Projects Office", from this website: https://waterdata.usgs.gov/ca/nwis/uv?site_no=381504121404001. This line is located under the div class stationContainer. Since this is a dynamic webpage, I'm using selenium to derive the html.

This is the html from the website.

img

This is my code:

from selenium import webdriver
from selenium.webdriver.common.by import By

browser = webdriver.Chrome()
url = "https://waterdata.usgs.gov/ca/nwis/uv?site_no=381504121404001"
browser.get(url) #navigate to the page
innerHTML = browser.execute_script("return document.body.innerHTML")
elem = browser.find_elements_by_xpath("//div[@class='stationContainer']")

print (elem)

I get the this result from my print message:

selenium.webdriver.remote.webelement.WebElement (session="96fc124c0e2d1fd4cd86f61db272d52a", element="0.5862443940581294-1")

I'm hoping to derive the text by searching through the div class, but it seems I'm not going about this the right way.

Upvotes: 3

Views: 462

Answers (3)

coder
coder

Reputation: 12972

Well, the content you want to scrap is not actually dynamic. You can use bs4 to fetch the div class stationContainer content. What makes this a bit challenging is that the element you're searching is not between certain tags. So a solution to this is an easy string manipulation to extract the content between the </form> and the <br/><br/> tag, like so:

from bs4 import BeautifulSoup
from requests import get

soup = BeautifulSoup(get('https://your_url_here').text, "html.parser")

for i in soup.find_all('div', attrs={'class':"stationContainer"}):
    print str(i).split('</form>')[1].split('<br/><br/>')[0].strip()

This code produces the appropriate result!

Upvotes: 0

Dean W.
Dean W.

Reputation: 642

print (elem.text)

elem is a WebElement object, hence the printed message. If you want to access the text, you need to add .text to the end, or if you want to grab some other attribute you can do something like elem.get_attribute('innerHTML').

Also, since the div element has a lot of other text, you're going to be getting a lot more text than what you want. I haven't looked into other similar pages, but perhaps you could extract what's between </form> and <br><br> in the div's html.

Upvotes: 1

JavaKungFu
JavaKungFu

Reputation: 1314

elem is a list not a string. Try this:

elem = browser.find_elements_by_xpath("//div[@class='stationContainer']")[0]
print elem.text

That prints out all the content. So you probably need a better selector or a way to parse the rest of it out.

Upvotes: 1

Related Questions