SAMARTH BHARTI
SAMARTH BHARTI

Reputation: 7

Selenium (Python): Remove all webelements from text

I am trying to scrape twitter, and I am concerned only with the tweet text as off now. When I narrow my down to the 'p' tag that contains the text, there are unprecedented tags like 'strong' inside the text that I just can't get rid off.

For example, this is what outputs when I print my tag text:

> <selenium.webdriver.remote.webelement.WebElement
> (session="5dd609e4b0694f9c363007d68d5b698a",
> element="0.02910224956545071-1")>
> <selenium.webdriver.remote.webelement.WebElement
> (session="5dd609e4b0694f9c363007d68d5b698a",
> element="0.02910224956545071-2")> Trevor Noah challenging Tomi Lahren
> and her stance on Black Lives Matter, her racist narratives, Donald
> Trump and more

While the output I expect is as follows:

> Trevor Noah challenging Tomi Lahren and her stance on Black Lives
> Matter, her racist narratives, Donald Trump and more

Another example is as follows:

> <selenium.webdriver.remote.webelement.WebElement
> (session="5dd609e4b0694f9c363007d68d5b698a",
> element="0.18626949664745118-10")> If the Cubs can win the World
> Series, Donald Trump can win the presidency, and the Cowboys can win
> 11-straight, then I can survive finals

Here is what I expect:

> If the Cubs can win the World
> Series, Donald Trump can win the presidency, and the Cowboys can win
> 11-straight, then I can survive finals

The number of occurrences and position of this webelement is different for every iteration, and hence I'm stuck. I have tried regex, but couldn't solve the issue. Any help would be appreciated. Thankyou!

Upvotes: 0

Views: 1296

Answers (2)

SAMARTH BHARTI
SAMARTH BHARTI

Reputation: 7

This is how I did it using beautifulsoup.

id = tweet.find_element_by_class_name("js-tweet-text-container").find_element_by_tag_name("p").text
                soup = BeautifulSoup(id)
                text = soup.get_text()
                print(text)

Upvotes: 0

emporerblk
emporerblk

Reputation: 1066

Always read the docs first!

>>> tweet_element = tweet.find_element_by_class_name("js-tweet-text-
container").‌​find_element_by_tag_‌​name("p").text. print(re.sub(r'.*>', 
'', str(id))) 
>>> tweet_element.text
"If the Cubs can win the World Series, Donald Trump can win the >>> 
presidency, and the Cowboys can win 11-straight, then I can survive 
finals"

Selenium is not recommended for scraping. Please, if you can, switch to either the official Twitter API, Tweepy (a Python library for the Twitter API), or even Requests and BeautifulSoup.

Upvotes: 1

Related Questions