Reputation: 123
I am new to regexes, and I am trying write a function to break down a string into its contractions and punctuation.
For example: I'm feeling sad today.
should return a list: ["I","'m","feeling","sad","today","."]
.
So far I can only filter the letters with ([a-zA-Z])\w*
and I am not sure how I would filter this to include the punctuation.
Upvotes: 1
Views: 2330
Reputation: 120598
You need to search for runs of characters which are either only word-characters, or neither word-characters nor whitespace:
>>> s = "I'm feeling sad today."
>>> rgx = re.compile(r'(\w+|[^\w\s]+)')
>>> rgx.findall(s)
['I', "'", 'm', 'feeling', 'sad', 'today', '.']
EDIT:
To capture contractions, the regexp needs to be more sophisticated. It must use a look-behind assertion to check that the apostrophe is preceeded by a word-character (otherwise it will wrongly match quoted words). Here's a basic solution:
>>> s = "I'm feeling 'sad' today."
>>> rgx = re.compile(r"((?<=\w)'\w+|\w+|[^\w\s]+)")
>>> rgx.findall(s)
['I', "'m", 'feeling', "'", 'sad', "'", 'today', '.']
There are some edge cases that this can't deal with, though. For instance, there are some transliterated foreign words (e.g. Qur'an
) that contain embedded apostrophes. And then of course there are names like O'Connor
and possessives such as O'Connor's
, as well as non-standard contractions like His 'n' Hers
.
Upvotes: 0
Reputation: 5658
import re
st = "I'm feeling sad today."
li = re.findall(r'\w+|[;.,!?:]|\'\w+',st)
['I', "'m", 'feeling', 'sad', 'today', '.']
Upvotes: 2