Reputation: 943
I am attempting to write a regex in python that will match all non-word characters (spaces, slashes, colons, etc.) excluding those that exist in a url. I know I can get all non-word characters with \W+
and I also have a regex to get urls: https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+|https*:\/\/[\w\.]+\.[a-zA-Z]*|[\w\.]+\.[a-zA-Z]*\/[\w\/\-]+
but I can't figure out a way to combine them. What would be the best way to get what I need here?
EDIT
To clarify, I am trying to split on this regex. So when I attempt to using re.split()
with the following regex: https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+|https*:\/\/[\w\.]+\.[a-zA-Z]*|[\w\.]+\.[a-zA-Z]*\/[\w\/\-]+|(\W)
I end up with something like the following:
INPUT: this is a test: https://www.google.com
OUTPUT: ['this', ' ', 'is', ' ', 'a', ' ', 'test', ':', '', ' ', '', None, '']
What I'm hoping to get is this: ['this', 'is', 'a', 'test', 'https://www.google.com']
This is how I'm splitting:
import re
message = 'this is a test: https://www.google.com'
re.split("https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+|https*:\/\/[\w\.]+\.[a- zA-Z]*|[\w\.]+\.[a-zA-Z]*\/[\w\/\-]+|(\W)", message)
Upvotes: 2
Views: 124
Reputation: 626802
You should use a reverse logic, match a URL pattern or any one or more word chars:
import re
rx = r"https*://[\w.]+\.[\w/-]*|[\w.]+\.[a-zA-Z]*/[\w/-]+|\w+"
message = 'this is a test: https://www.google.com'
print( re.findall(rx, message) )
# => ['this', 'is', 'a', 'test', 'https://www.google.com']
See the Python demo.
Note I shortened your URL pattern, you had two similar alternatives, https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+
and https*:\/\/[\w\.]+\.[a-zA-Z]*
, where [a-zA-Z]*
is redundant as it matches any zero or more letters and the next [\w\/\-]+
pattern requires one or more letters, /
or -
chars. You also do not have to escape dots inside character classes and slashes, the unnecessary escapes are removed here.
Upvotes: 1