monstermatt
monstermatt

Reputation: 69

python re split at all space and punctuation except for the apostrophe

i want to split a string by all spaces and punctuation except for the apostrophe sign. Preferably a single quote should still be used as a delimiter except for when it is an apostrophe. I also want to keep the delimeters. example string
words = """hello my name is 'joe.' what's your's"""

Here is my re pattern thus far splitted = re.split(r"[^'-\w]",words.lower()) I tried throwing the single quote after the ^ character but it is not working.

My desired output is this. splitted = [hello,my,name,is,joe,.,what's,your's]

Upvotes: 1

Views: 2121

Answers (3)

The fourth bird
The fourth bird

Reputation: 163447

One option is to make use of lookarounds to split at the desired positions, and use a capture group what you want to keep in the split.

After the split, you can remove the empty entries from the resulting list.

\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])

The pattern matches

  • \s+ Match 1 or more whitespace chars
  • | Or
  • (?<=\s)' Match ' preceded by a whitespace char
  • | Or
  • '(?=\s) Match ' when followed by a whitespace char
  • | Or
  • (?<=\w)([,.!?]) Capture one of , . ! ? in group 1, when preceded by a word character

See a regex demo and a Python demo.

Example

import re

pattern = r"\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])"
words = """hello my name is 'joe.' what's your's"""
result = [s for s in re.split(pattern, words) if s]
print(result)

Output

['hello', 'my', 'name', 'is', 'joe', '.', "what's", "your's"]

Upvotes: 1

Joseph Fakelastname
Joseph Fakelastname

Reputation: 895

I love regex golf!

words = """hello my name is 'joe.' what's your's"""
splitted = re.findall(r"\b(?:\w'\w|\w)+\b", words)

The part in the parenthesis is a group that matches either an apostrophe surrounded by letters or a single letter.

EDIT:

This is more flexible:

re.findall(r"\b(?:(?<=\w)'(?=\w)|\w)+\b", words)

It's getting a bit unreadable at this point though, in practice you should probably use Woodford's answer.

Upvotes: 0

Woodford
Woodford

Reputation: 4449

It might be simpler to simply process your list after splitting without accounting for them at first:

>>> words = """hello my name is 'joe.' what's your's"""
>>> split_words = re.split(r"[ ,.!?]", words.lower())  # add punctuation you want to split on
>>> split_words
['hello', 'my', 'name', 'is', "'joe.'", "what's", "your's"]
>>> [word.strip("'") for word in split_words]
['hello', 'my', 'name', 'is', 'joe.', "what's", "your's"]

Upvotes: 2

Related Questions