Reputation: 691
I am parsing a text on non alphanumeric characters and would like to exclude specific characters like apostrophes, dash/hyphens and commas.
I would like to build a regex for the following cases:
This is what i have tried:
def split_text(text):
my_text = re.split('\W',text)
# the following doesn't work.
#my_text = re.split('([A-Z]\w*)',text)
#my_text = re.split("^[a-zA-Z0-9]+(-[a-zA-Z0-9]+)*$",text)
return my_text
Any ideas
Upvotes: 3
Views: 2006
Reputation: 24812
is this what you want?
non-alphanumeric character, excluding apostrophes and hypens
my_text = re.split(r"[^\w'-]+",text)
non-alphanumeric character, excluding commas,apostrophes and hypens
my_text = re.split(r"[^\w-',]+",text)
the []
syntax defines a character class, [^..]
"complements" it, i.e. it negates it.
See the documentation about that:
Characters that are not within a range can be matched by complementing the set. If the first character of the set is
'^'
, all the characters that are not in the set will be matched. For example,[^5]
will match any character except'5'
, and[^^]
will match any character except'^'
.^
has no special meaning if it’s not the first character in the set.
Upvotes: 3
Reputation: 336158
You can use a negated character class for this:
my_text = re.split(r"[^\w'-]+",text)
or
my_text = re.split(r"[^\w,'-]+",text) # also excludes commas
Upvotes: 3