Zorni97
Zorni97

Reputation: 23

How to choose first match from Alternation regex?

I am trying to extract all the text from the tweets before the URL starting with "https:...".

Example Tweet:

"This traditional hairdo is back in fashion thanks to the coronavirus, and Kenyans are using it to raise awareness https://... (Video via @QuickTake)"

In this example I would like to remove the "https://... (Video via @QuickTake)" and get the text from the beginning. But it should also work for when the tweet comes without any URL link in the tweet text.

I have tried this expression and gets two matches for when it comes with URL:

/(.*)(?=\shttps.*)|(.*)

How can I make it to retrieve only the text from the tweets.

Thanks in advance!

Upvotes: 2

Views: 103

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626690

You may remove the https and all tha follows till the end of string, use

tweet = re.sub(r'\s*https.*', '', tweet)

Details:

  • \s* - 0+ whitespaces
  • https - a string
  • .* - the rest of the string (line).

Upvotes: 1

boot-scootin
boot-scootin

Reputation: 12515

This might be an oversimplification, but a simple str.find might do the trick:

>>> s = "This traditional hairdo is back in fashion thanks to the coronavirus, and Kenyans are using it to raise awareness https://... (Video via @QuickTake)"
>>> s[:s.find('https://')]
'This traditional hairdo is back in fashion thanks to the coronavirus, and Kenyans are using it to raise awareness '

You basically just index the tweet until the point at which you find the first instance of https://.

Note that approach alone won't work in the case of https:// not appearing in a tweet. When https:// isn't found, s.find('https://') will return -1, which will mess up our indexing. If it's not found, just set the indexer (link_index below) to the length of the full tweet:

>>> s = 'this is some tweet without a URL'
>>> link_index = s.find('https://')
>>> if link_index == -1:
...     link_index = len(s)
... 
>>> s[:link_index]
'this is some tweet without a URL'

Upvotes: 0

Related Questions