NeptuneGamer
NeptuneGamer

Reputation: 123

Regex- capturing punctuation and contractions

I am new to regexes, and I am trying write a function to break down a string into its contractions and punctuation.

For example: I'm feeling sad today.

should return a list: ["I","'m","feeling","sad","today","."].

So far I can only filter the letters with ([a-zA-Z])\w* and I am not sure how I would filter this to include the punctuation.

Upvotes: 1

Views: 2330

Answers (2)

ekhumoro
ekhumoro

Reputation: 120598

You need to search for runs of characters which are either only word-characters, or neither word-characters nor whitespace:

>>> s = "I'm feeling sad today."
>>> rgx = re.compile(r'(\w+|[^\w\s]+)')
>>> rgx.findall(s)
['I', "'", 'm', 'feeling', 'sad', 'today', '.']

EDIT:

To capture contractions, the regexp needs to be more sophisticated. It must use a look-behind assertion to check that the apostrophe is preceeded by a word-character (otherwise it will wrongly match quoted words). Here's a basic solution:

>>> s = "I'm feeling 'sad' today."
>>> rgx = re.compile(r"((?<=\w)'\w+|\w+|[^\w\s]+)")
>>> rgx.findall(s)
['I', "'m", 'feeling', "'", 'sad', "'", 'today', '.']

There are some edge cases that this can't deal with, though. For instance, there are some transliterated foreign words (e.g. Qur'an) that contain embedded apostrophes. And then of course there are names like O'Connor and possessives such as O'Connor's, as well as non-standard contractions like His 'n' Hers.

Upvotes: 0

LetzerWille
LetzerWille

Reputation: 5658

import re

st = "I'm feeling sad today."

li = re.findall(r'\w+|[;.,!?:]|\'\w+',st)

['I', "'m", 'feeling', 'sad', 'today', '.']

Upvotes: 2

Related Questions