user3247054
user3247054

Reputation: 691

Regex to ignore specific characters

I am parsing a text on non alphanumeric characters and would like to exclude specific characters like apostrophes, dash/hyphens and commas.

I would like to build a regex for the following cases:

  1. non-alphanumeric character, excluding apostrophes and hypens
  2. non-alphanumeric character, excluding commas,apostrophes and hypens

This is what i have tried:

def split_text(text):
    my_text = re.split('\W',text)

    # the following doesn't work.
    #my_text = re.split('([A-Z]\w*)',text)
    #my_text = re.split("^[a-zA-Z0-9]+(-[a-zA-Z0-9]+)*$",text)

    return my_text

Any ideas

Upvotes: 3

Views: 2006

Answers (2)

zmo
zmo

Reputation: 24812

is this what you want?

non-alphanumeric character, excluding apostrophes and hypens

my_text = re.split(r"[^\w'-]+",text)

non-alphanumeric character, excluding commas,apostrophes and hypens

my_text = re.split(r"[^\w-',]+",text)

the [] syntax defines a character class, [^..] "complements" it, i.e. it negates it.

See the documentation about that:

Characters that are not within a range can be matched by complementing the set. If the first character of the set is '^', all the characters that are not in the set will be matched. For example, [^5] will match any character except '5', and [^^] will match any character except '^'. ^ has no special meaning if it’s not the first character in the set.

Upvotes: 3

Tim Pietzcker
Tim Pietzcker

Reputation: 336158

You can use a negated character class for this:

my_text = re.split(r"[^\w'-]+",text)

or

my_text = re.split(r"[^\w,'-]+",text)   # also excludes commas

Upvotes: 3

Related Questions