Reputation: 7
I am trying to scrape twitter, and I am concerned only with the tweet text as off now. When I narrow my down to the 'p' tag that contains the text, there are unprecedented tags like 'strong' inside the text that I just can't get rid off.
For example, this is what outputs when I print my tag text:
> <selenium.webdriver.remote.webelement.WebElement
> (session="5dd609e4b0694f9c363007d68d5b698a",
> element="0.02910224956545071-1")>
> <selenium.webdriver.remote.webelement.WebElement
> (session="5dd609e4b0694f9c363007d68d5b698a",
> element="0.02910224956545071-2")> Trevor Noah challenging Tomi Lahren
> and her stance on Black Lives Matter, her racist narratives, Donald
> Trump and more
While the output I expect is as follows:
> Trevor Noah challenging Tomi Lahren and her stance on Black Lives
> Matter, her racist narratives, Donald Trump and more
Another example is as follows:
> <selenium.webdriver.remote.webelement.WebElement
> (session="5dd609e4b0694f9c363007d68d5b698a",
> element="0.18626949664745118-10")> If the Cubs can win the World
> Series, Donald Trump can win the presidency, and the Cowboys can win
> 11-straight, then I can survive finals
Here is what I expect:
> If the Cubs can win the World
> Series, Donald Trump can win the presidency, and the Cowboys can win
> 11-straight, then I can survive finals
The number of occurrences and position of this webelement is different for every iteration, and hence I'm stuck. I have tried regex, but couldn't solve the issue. Any help would be appreciated. Thankyou!
Upvotes: 0
Views: 1296
Reputation: 7
This is how I did it using beautifulsoup.
id = tweet.find_element_by_class_name("js-tweet-text-container").find_element_by_tag_name("p").text
soup = BeautifulSoup(id)
text = soup.get_text()
print(text)
Upvotes: 0
Reputation: 1066
>>> tweet_element = tweet.find_element_by_class_name("js-tweet-text-
container").find_element_by_tag_name("p").text. print(re.sub(r'.*>',
'', str(id)))
>>> tweet_element.text
"If the Cubs can win the World Series, Donald Trump can win the >>>
presidency, and the Cowboys can win 11-straight, then I can survive
finals"
Selenium is not recommended for scraping. Please, if you can, switch to either the official Twitter API, Tweepy (a Python library for the Twitter API), or even Requests and BeautifulSoup.
Upvotes: 1