Regex to ignore specific characters

Question

I am parsing a text on non alphanumeric characters and would like to exclude specific characters like apostrophes, dash/hyphens and commas.

I would like to build a regex for the following cases:

non-alphanumeric character, excluding apostrophes and hypens
non-alphanumeric character, excluding commas,apostrophes and hypens

This is what i have tried:

def split_text(text):
    my_text = re.split('\W',text)

    # the following doesn't work.
    #my_text = re.split('([A-Z]\w*)',text)
    #my_text = re.split("^[a-zA-Z0-9]+(-[a-zA-Z0-9]+)*$",text)

    return my_text

Case 1:
- Sample Input: What's up? It's good to see you my-friend. "Hello" to-the world!.
- Sample Output: ['What's','up','It's','good','to','see','you','my-friend','Hello','to-the','world']
Case 2:
- Sample Input: It means that, it's not good-to do such things.
- Sample Output: ['It', 'means', 'that,', 'it's', 'not', 'good-to', 'do', 'such', 'things']

Any ideas

zmo · Accepted Answer

is this what you want?

non-alphanumeric character, excluding apostrophes and hypens

my_text = re.split(r"[^\w'-]+",text)

non-alphanumeric character, excluding commas,apostrophes and hypens

my_text = re.split(r"[^\w-',]+",text)

the [] syntax defines a character class, [^..] "complements" it, i.e. it negates it.

See the documentation about that:

Characters that are not within a range can be matched by complementing the set. If the first character of the set is '^', all the characters that are not in the set will be matched. For example, [^5] will match any character except '5', and [^^] will match any character except '^'. ^ has no special meaning if it’s not the first character in the set.

Regex to ignore specific characters

Answers (2)

Related Questions