John Smith
John Smith

Reputation: 135

Regex Code for getting only words in title case from a paragraph

I'm looking for a regex expression that only returns words in Title Case (where only the first letter is capitalized) from a given sentence or paragraph.

If the paragraph is:

France’s last serious attempt at ambitious economic reform, an overhaul of pensions and social security, was in the mid-1990s under President Jacques Chirac.

I'd like it to match France, President, Jacques and Chirac.

(I'm writing in Python 3)

Upvotes: 1

Views: 914

Answers (3)

Toto
Toto

Reputation: 91415

To deal with any language letters, use unicode properties:

re.findall(r"\b\p{Lu}\p{Ll}+", inputLine)

where

  • \p{Lu} stands for any uppercase letter in any language
  • \p{Ll} stands for any lowercase letter in any language

Upvotes: 0

Bohemian
Bohemian

Reputation: 425033

Use a word boundary, a capital letter, then as many lowercase letters as follow:

\b[A-Z][a-z]+

Like this:

titleWords = re.findall(r"\b[A-Z][a-z]+", line)

See live demo.

Note that + (at least 1) is preferable to * (0 or more) so you don't match single-capital-letter words, like "I" and "A".

The word boundary isn't really necessary, but prevents matching camelcase words like "mySpace" which shouln't happen in regular text anyway, so you could probably remove \b without ill effect.

Upvotes: 1

Psi
Psi

Reputation: 6783

Depending on the regex-flavour, the results may differ.

For PCRE, I suggest:

/\b[A-Z][a-z]*\b/

Upvotes: 0

Related Questions