Reputation: 23
I am trying to extract all the text from the tweets before the URL starting with "https:...".
Example Tweet:
"This traditional hairdo is back in fashion thanks to the coronavirus, and Kenyans are using it to raise awareness https://... (Video via @QuickTake)"
In this example I would like to remove the "https://... (Video via @QuickTake)" and get the text from the beginning. But it should also work for when the tweet comes without any URL link in the tweet text.
I have tried this expression and gets two matches for when it comes with URL:
/(.*)(?=\shttps.*)|(.*)
How can I make it to retrieve only the text from the tweets.
Thanks in advance!
Upvotes: 2
Views: 103
Reputation: 626690
You may remove the https
and all tha follows till the end of string, use
tweet = re.sub(r'\s*https.*', '', tweet)
Details:
\s*
- 0+ whitespaceshttps
- a string.*
- the rest of the string (line).Upvotes: 1
Reputation: 12515
This might be an oversimplification, but a simple str.find
might do the trick:
>>> s = "This traditional hairdo is back in fashion thanks to the coronavirus, and Kenyans are using it to raise awareness https://... (Video via @QuickTake)"
>>> s[:s.find('https://')]
'This traditional hairdo is back in fashion thanks to the coronavirus, and Kenyans are using it to raise awareness '
You basically just index the tweet until the point at which you find the first instance of https://
.
Note that approach alone won't work in the case of https://
not appearing in a tweet. When https://
isn't found, s.find('https://')
will return -1, which will mess up our indexing. If it's not found, just set the indexer (link_index
below) to the length of the full tweet:
>>> s = 'this is some tweet without a URL'
>>> link_index = s.find('https://')
>>> if link_index == -1:
... link_index = len(s)
...
>>> s[:link_index]
'this is some tweet without a URL'
Upvotes: 0