Get all non-word characters excluding those in a url

Question

I am attempting to write a regex in python that will match all non-word characters (spaces, slashes, colons, etc.) excluding those that exist in a url. I know I can get all non-word characters with \W+ and I also have a regex to get urls: https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+|https*:\/\/[\w\.]+\.[a-zA-Z]*|[\w\.]+\.[a-zA-Z]*\/[\w\/\-]+ but I can't figure out a way to combine them. What would be the best way to get what I need here?

EDIT

To clarify, I am trying to split on this regex. So when I attempt to using re.split() with the following regex: https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+|https*:\/\/[\w\.]+\.[a-zA-Z]*|[\w\.]+\.[a-zA-Z]*\/[\w\/\-]+|(\W) I end up with something like the following:

INPUT: this is a test: https://www.google.com

OUTPUT: ['this', ' ', 'is', ' ', 'a', ' ', 'test', ':', '', ' ', '', None, '']

What I'm hoping to get is this: ['this', 'is', 'a', 'test', 'https://www.google.com']

This is how I'm splitting:

import re

message = 'this is a test: https://www.google.com'
re.split("https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+|https*:\/\/[\w\.]+\.[a- zA-Z]*|[\w\.]+\.[a-zA-Z]*\/[\w\/\-]+|(\W)", message)

Wiktor Stribiżew · Accepted Answer

You should use a reverse logic, match a URL pattern or any one or more word chars:

import re
rx = r"https*://[\w.]+\.[\w/-]*|[\w.]+\.[a-zA-Z]*/[\w/-]+|\w+"
message = 'this is a test: https://www.google.com'
print( re.findall(rx, message) )
# => ['this', 'is', 'a', 'test', 'https://www.google.com']

See the Python demo.

Note I shortened your URL pattern, you had two similar alternatives, https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+ and https*:\/\/[\w\.]+\.[a-zA-Z]*, where [a-zA-Z]* is redundant as it matches any zero or more letters and the next [\w\/\-]+ pattern requires one or more letters, / or - chars. You also do not have to escape dots inside character classes and slashes, the unnecessary escapes are removed here.

Get all non-word characters excluding those in a url

Answers (1)

Related Questions