Reputation: 123

Regex- capturing punctuation and contractions

I am new to regexes, and I am trying write a function to break down a string into its contractions and punctuation.

For example: I'm feeling sad today.

should return a list: ["I","'m","feeling","sad","today","."].

So far I can only filter the letters with ([a-zA-Z])\w* and I am not sure how I would filter this to include the punctuation.

Upvotes: 1

Answers (2)

ekhumoro

Reputation: 120598

You need to search for runs of characters which are either only word-characters, or neither word-characters nor whitespace:

>>> s = "I'm feeling sad today."
>>> rgx = re.compile(r'(\w+|[^\w\s]+)')
>>> rgx.findall(s)
['I', "'", 'm', 'feeling', 'sad', 'today', '.']

EDIT:

To capture contractions, the regexp needs to be more sophisticated. It must use a look-behind assertion to check that the apostrophe is preceeded by a word-character (otherwise it will wrongly match quoted words). Here's a basic solution:

>>> s = "I'm feeling 'sad' today."
>>> rgx = re.compile(r"((?<=\w)'\w+|\w+|[^\w\s]+)")
>>> rgx.findall(s)
['I', "'m", 'feeling', "'", 'sad', "'", 'today', '.']

There are some edge cases that this can't deal with, though. For instance, there are some transliterated foreign words (e.g. Qur'an) that contain embedded apostrophes. And then of course there are names like O'Connor and possessives such as O'Connor's, as well as non-standard contractions like His 'n' Hers.

Upvotes: 0

LetzerWille

Reputation: 5658

import re

st = "I'm feeling sad today."

li = re.findall(r'\w+|[;.,!?:]|\'\w+',st)

['I', "'m", 'feeling', 'sad', 'today', '.']

Upvotes: 2

Regex- capturing punctuation and contractions

Answers (2)

Related Questions