moglido
moglido

Reputation: 148

Invalid Selector Error: Webscraping different kinds of text from multiple spans using xpath and Selenium

I am trying to scrape out a list of comma-separated authors with an asterisk in the following format [important]:

First Last, First Last, First Last*, First Last

The html section I am scraping is super complicated, but I've successfully tested an xpath that results in text and symbols that I want.

//span[@class="hlFld-ContribAuthor"]/span[@class="hlFld-ContribAuthor"]/a/text() | //span[@class="NLM_x"]/x/text() | //a[@class="ref"]/sup/text()

Here's the result: enter image description here

However, when I use that formula in my python code, I get an error.

My code:

# get authors
xpath = "//span[@class=\"hlFld-ContribAuthor\"]/span[@class=\"hlFld-ContribAuthor\"]/a/text() | //span[@class=\"NLM_x\"]/x/text() | //a[@class=\"ref\"]/sup/text()"
authors = driver.find_element_by_xpath(xpath)
print str(authors)

Error:

InvalidSelectorException: Message: The given selector //span[@class="hlFld-ContribAuthor"]/span[@class="hlFld-ContribAuthor"]/a/text() | //span[@class="NLM_x"]/x/text() | //a[@class="ref"]/sup/text() is either invalid or does not result in a WebElement. The following error occurred: InvalidSelectorError: The result of the xpath expression "//span[@class="hlFld-ContribAuthor"]/span[@class="hlFld-ContribAuthor"]/a/text() | //span[@class="NLM_x"]/x/text() | //a[@class="ref"]/sup/text()" is: [object Text]. It should be an element.

How do I get selenium to grab the right text and symbols that I need in the right order? I haven't been able to print the results of my xpath without new lines.

EDIT: solved the xpath error by removing /text() from xpaths

Upvotes: 1

Views: 310

Answers (1)

gtlambert
gtlambert

Reputation: 11971

The function driver.find_element_by_xpath(my_xpath) expects to find a DOM element when it locates the node identified by my_xpath. If it doesn't, it throws an error. Your XPath expressions all return text nodes, hence cause an error.

To return the DOM elements instead, alter your XPath expression to:

"//span[@class=\"hlFld-ContribAuthor\"]/span[@class=\"hlFld-ContribAuthor\"]/a | //span[@class=\"NLM_x\"]/x | //a[@class=\"ref\"]/sup"

Also, Since you are returning multiple elements, you should use driver.find_elements_by_xpath (note plural) instead of driver.find_element_by_xpath.

You will then be able to grab the desired text from each author element by looping over authors:

for author in authors:
    print(author.text)

Upvotes: 1

Related Questions