Reputation: 23
I need to remove any URL in the tweets review. How to only remove the URL if it is found in the beginning of tweet?
I've try some code and this python code with regex successfully remove URL but if URL found in the beginning of tweets, all of the sentence will be remove as well.
re.sub(r'https?:\/\/.*[\r\n]*\S+', '', verbatim, flags = re.MULTILINE)
If URL found in the beginning of tweets, all of the sentence will be remove as well.
Upvotes: 2
Views: 185
Reputation: 163217
The pattern https?:\/\/.*[\r\n]*\S+
matches http(optional s)://
Then the .*
part matches until the end of the string, then this part [\r\n]*
matches 0+ newlines and \S+
will match 1+ non whitespace chars.
So the url is matched, followed by the rest of the string, a newline and 1+ non whitespace chars at the next line as well.
You could shorten the pattern to:
\bhttps?://\S+
Upvotes: 2
Reputation: 466
Try making your regex lazy by adding ? and matching to the final space character
Also, added escaping for the backslashes
re.sub(r'https?://.?[\r\n][\s?]', '', verbatim, flags = re.MULTILINE)
Upvotes: 0