Razif Azmal
Razif Azmal

Reputation: 23

This python code with regex successfully remove URL but if URL found in the beginning of tweets, all of the sentence will be remove as well

I need to remove any URL in the tweets review. How to only remove the URL if it is found in the beginning of tweet?

I've try some code and this python code with regex successfully remove URL but if URL found in the beginning of tweets, all of the sentence will be remove as well.

re.sub(r'https?:\/\/.*[\r\n]*\S+', '', verbatim, flags = re.MULTILINE)

If URL found in the beginning of tweets, all of the sentence will be remove as well.

Upvotes: 2

Views: 185

Answers (2)

The fourth bird
The fourth bird

Reputation: 163217

The pattern https?:\/\/.*[\r\n]*\S+ matches http(optional s)://

Then the .* part matches until the end of the string, then this part [\r\n]* matches 0+ newlines and \S+ will match 1+ non whitespace chars.

So the url is matched, followed by the rest of the string, a newline and 1+ non whitespace chars at the next line as well.

You could shorten the pattern to:

\bhttps?://\S+

Regex demo

Upvotes: 2

davidgamero
davidgamero

Reputation: 466

Try making your regex lazy by adding ? and matching to the final space character

Also, added escaping for the backslashes

re.sub(r'https?://.?[\r\n][\s?]', '', verbatim, flags = re.MULTILINE)

regex101 link to live demo

Upvotes: 0

Related Questions