ZheniaMagic
ZheniaMagic

Reputation: 119

How to access text element in selenium if it is splitted by body tags

I have a problem while trying to access some values on the website during the process of web scraping the data. The problem is that the text I want to extract is in the class which contains several texts separated by tags (these body tags also have texts which are also important for me).

So firstly, I tried to look for the tag with the text I needed ('Category' in this case) and then extract the exact category from the text below this body tag assignment. I could use precise XPath but here it is not the case because other pages I need to web scrape contain a different amount of rows in this sidebar so the locations, as well as XPaths, are different.

The expected output is 'utility' - the category in the sidebar.

The website and the text I need to extract look like that (look right at the sidebar containing 'Category':

enter image description here

The element looks like that:

enter image description here

And the code I tried:

driver = webdriver.Safari()
driver.get('https://www.statsforsharks.com/entry/MC_Squares')
element = driver.find_elements_by_xpath("//b[contains(text(), 'Category')]/following-sibling")
for value in element:
    print(value.text)
driver.close()

the link to the page with the data is https://www.statsforsharks.com/entry/MC_Squares.

Thank you!

Upvotes: 0

Views: 418

Answers (2)

spudWorks2020
spudWorks2020

Reputation: 11

There are easier ways when it's a MediaWiki website. You could, for instance, access the page data through the API with a JSON request and parse it with a much more limited DOM.

Any particular reason you want to scrape my website?

Upvotes: 1

Prab G
Prab G

Reputation: 356

You might be better off using regex here, as the whole text comes under the 'company-sidebar-body' class, where only some text is between b tags and some are not.

So, you can the text of the class first:

sidebartext = driver.find_element_by_class_name("company-sidebar-body").text

That will give you the following:

"EOY Proj Sales: $1,000,000\r\nSales Prev Year: $200,000\r\nCategory: Utility\r\nAsking Deal\r\nEquity: 10%\r\nAmount: $300,000\r\nValue: $3,000,000\r\nEquity Deal\r\nSharks: Kevin O'Leary\r\nEquity: 25%\r\nAmount: $300,000\r\nValue: $1,200,000\r\nBite: -$1,800,000"

You can then use regex to target the category:

import re

c = re.search("Category:\s\w+", sidebartext).group()

print(c)

c will result in 'Category: Utility' which you can then work with. This will also work if the value of the category ('Utility') is different on other pages.

Upvotes: 1

Related Questions