gitPyGirl
gitPyGirl

Reputation: 29

Python Regular Expression to search for and print 'apostrophes' (') in output

Is there a way to improve this regular expression to search for all words that ends with t, including don't? I also want to print the whole words, not just the last t.

r"\b\w*\Wt\b|\b\w*t\b"

I had to write out 2 separate cases for ending with either t or 't. Or this is the best it could be?

Upvotes: 2

Views: 60

Answers (2)

Ryszard Czech
Ryszard Czech

Reputation: 18621

Do not rely on generic patterns if all you want is allow an apostrophe. \W matches spaces, too. \S matches any characters different from whitespace.

Use

r"\b\w+'?t\b"

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  '?                       '\'' (optional (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  t                        't'
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char

Upvotes: 0

ggorlen
ggorlen

Reputation: 57115

I'd use \b\S*t\b. It fixes the problem of the engine having to scan a word only to fail to find the non-word character and try the other branch in your pattern. At the very least, swap the two sides of the alternation because the common-case is that the word won't have a contraction.

>>> import re
>>> s = "mitt cat bat don't foobar"
>>> re.findall(r"\b\S*t\b", s)
['mitt', 'cat', 'bat', "don't"]

It's not clear how you want to treat non-word punctuation, but consider a variant that attempts to handle this:

>>> s = "mitt cat bat. don't foobar tee t e.t."
>>> re.findall(r"\b\S*t\b", s)
['mitt', 'cat', 'bat', "don't", 't', 'e.t']
>>> re.findall(r"\b[^.,!?\s]*t\b", s)
['mitt', 'cat', 'bat', "don't", 't', 't']

Clearly, abbreviations and edge cases may need attention if that's part of your specification.

Upvotes: 2

Related Questions