Caveatrob
Caveatrob

Reputation: 13287

Regexp to pull capitalized words not at the beginning of sentence and two adjacent words

I want to pull out capitalized words that don't start a sentence along with the previous and following word.

I'm using:

(\w*)\b([A-Z][a-z]\w*)\b(\w*)

replace with:

$1 -- $2 -- $3

Edit: It's only returning the $2. Will try suggestions.

And regarding natural language? Don't care for this thing. I just want to see where capitals show up in a sentence so I can figure out if they're proper or not.

Upvotes: 2

Views: 1788

Answers (2)

Brendan
Brendan

Reputation: 1873

How about this?

([a-zA-Z]+)\s([A-Z][a-z]*)\s([a-zA-Z]+)

This doesn't take into account anything non-alphabetic though. It also assumes that all words are separated by a single whitespace character. You will need to modify it if you want more complex support.

Upvotes: 2

Tim Pietzcker
Tim Pietzcker

Reputation: 336488

Right now your regex fails because the \b can never match. It matches only between alphanumeric and non-alphanumeric characters; therefore it can never match between \w* and [A-Z] or another \w*.

So, you need some other (=non-alphanumeric) characters between your words:

Try

(\w*)\W+([A-Z][a-z]\w*)\W+(\w*)

although (if your regex engine allows using Unicode properties), you might be happier with

(\w*)\W+(\p{Lu}\p{Ll}\w*)\W+(\w*)

As written, only capitalized words of length 2 or more are matched, i. e. "I" (as in "me") will not be matched by this. I suppose you inserted the [a-z] to avoid matches like "IBM"? Or what was your intention?

Upvotes: 2

Related Questions