user0978189
user0978189

Reputation: 31

How to scrape hidden text from a web page?

I am trying to scrape some text from a web page. On my webpage there is a list of words being shown. Some of them are visible some others become visible when I click on "+ More". Once clicked, the list of words is always the same (same order same words). However, some of them are in bold some are in deleted. So basically each item of the database has some features. What I would like to do: for each item tell me which features are available and which not. My problem is to overcome the "+ More" button.

My script works fine only for those words which are shown and not for those which are hidden by "+ More". What I am trying to do is to collect all the words that follow under the node "del". I initially thought that through lxml, the web page would have been loaded as it appears in chrome inspect element and I wrote my code accordingly:

from lxml import html

tree = html.fromstring(br.open(current_url).get_data())

mydata={}

if len(tree.xpath('//del[text()='some text']')) > 0:
    mydata['some text'] = 'text is deleted from the web page!'
else:
    mydata['some text'] = 'text is not deleted'

Every time I ran this code what I can collect is actually part of data being shown on the web page, but not the complete list of words that would have been shown after clicking "+ More".

I had tried selenium, but as far as I understand it is not meant for parsing but rather to interact with the web page. However if I ran this:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://www.mywebpage.co.uk')

a = driver.find_element_by_xpath('//del[text()="some text"]')

I either get the element or an error. I would like to get an empty list so I could do:

mydata = {}

if len(driver.find_element_by_xpath('//del[text()="some text"]')) > 0:
    mydata['some text'] = 'text is deleted from the web page!'
else:
    mydata['some text'] = 'text is not deleted'

or find another way to get these "hidden" elements captured by the script.

My question is has anyone had this type of problem? How did them sorted it out?

Upvotes: 3

Views: 2159

Answers (1)

RemcoW
RemcoW

Reputation: 4336

If I understand correctly you want to find the element in a list. However Selenium throws an ElementNotFoundException if the element is not available on the page instead of returning a list.

The question I have is why do you want a list? Judging by your example you want to see if an element is present on the page or not. You can easily achieve this by using a try/except.

from selenium.common.exceptions import TimeoutException

try:
    driver.find_element_by_xpath('//del[text()="some text"]')
    mydata['some text'] = 'text is deleted from the web page!'
except TimeOutException:
    mydata['some text'] = 'text is not deleted'

Now if you really really need this list you could search the page for multiple elements. This will return all the elements that match the locator in a list. To do this replace:

driver.find_element_by_xpath('//del[text()="some text"]')

With (elements):

driver.find_elements_by_xpath('//del[text()="some text"]')

Upvotes: 1

Related Questions