Reputation: 135
I'm looking for a regex expression that only returns words in Title Case (where only the first letter is capitalized) from a given sentence or paragraph.
If the paragraph is:
France’s last serious attempt at ambitious economic reform, an overhaul of pensions and social security, was in the mid-1990s under President Jacques Chirac.
I'd like it to match France
, President
, Jacques
and Chirac
.
(I'm writing in Python 3)
Upvotes: 1
Views: 914
Reputation: 91415
To deal with any language letters, use unicode properties:
re.findall(r"\b\p{Lu}\p{Ll}+", inputLine)
where
\p{Lu}
stands for any uppercase letter in any language\p{Ll}
stands for any lowercase letter in any languageUpvotes: 0
Reputation: 425033
Use a word boundary, a capital letter, then as many lowercase letters as follow:
\b[A-Z][a-z]+
Like this:
titleWords = re.findall(r"\b[A-Z][a-z]+", line)
See live demo.
Note that +
(at least 1) is preferable to *
(0 or more) so you don't match single-capital-letter words, like "I"
and "A"
.
The word boundary isn't really necessary, but prevents matching camelcase words like "mySpace"
which shouln't happen in regular text anyway, so you could probably remove \b
without ill effect.
Upvotes: 1
Reputation: 6783
Depending on the regex-flavour, the results may differ.
For PCRE, I suggest:
/\b[A-Z][a-z]*\b/
Upvotes: 0