Reputation: 4356
I would like to remove urls from my text:
#Django url validator https://github.com/django/django/blob/master/django/core/validators.py
regex = re.compile(
r'^(?:http|ftp)s?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' # domain...
r'localhost|' # localhost...
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|' # ...or ipv4
r'\[?[A-F0-9]*:[A-F0-9:]+\]?)' # ...or ipv6
r'(?::\d+)?' # optional port
r'(?:/?|[/?]\S+)$', re.IGNORECASE)
text = "http://test.com word1 word2 https://test.de word3"
text = re.sub(regex, '', text)
print text
the output is still :
http://test.com word1 word2 https://test.de word3
What's wrong with my code?
Upvotes: 0
Views: 1593
Reputation: 13251
Your regex is anchored to the beginning and end of the string with the ^
and $
characters. So just remove them:
regex = re.compile(
r'(?:http|ftp)s?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' # domain...
r'localhost|' # localhost...
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|' # ...or ipv4
r'\[?[A-F0-9]*:[A-F0-9:]+\]?)' # ...or ipv6
r'(?::\d+)?' # optional port
r'(?:/?|[/?]\S+)', re.IGNORECASE)
Upvotes: 2