Mad Dog Tannen
Mad Dog Tannen

Reputation: 7244

Find words with capital first letter, that are grouped together

I'm trying to find words within a string that is single, or grouped together.

For example:

This is a String That is my example, Here Is More text as example.

I want to take out so my result is the following.

This
String That
Here Is More

The regex I have so far is this

(\b[A-Z][a-z]*\s\b)

This finds capitalized words but only groups them separately containing the space. How can control the regex to accept 1 to 3 words in a row, with capital letters?

Upvotes: 2

Views: 810

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626845

A truly Unicode supporting solution is

\b(?>\p{Lu}\p{M}*)(?>\p{L}\p{M}*)*(?:\s+(?>\p{Lu}\p{M}*)(?>\p{L}\p{M}*)*){0,2}\b

It will only match 1-3 capitalized words at a row without leading/trailing whitespace.

See regex demo

Here is the explanation:

  • \b - word boundary (there should be a non-word character before it)
  • (?>\p{Lu}\p{M}*)(?>\p{L}\p{M}*)* - a word starting with an uppercase letter (followed by optional diacritics) then followed with any (precomposed, too) Unicode letters
  • (?:\s+(?>\p{Lu}\p{M}*)(?>\p{L}\p{M}*)*){0,2} - 2 to 0 occurrences of
    • \s+ - 1 or more whitespaces (\s+) followed by...
    • (?>\p{Lu}\p{M}*)(?>\p{L}\p{M}*)* - a word consisting of Unicode letters (potentially with diacritics).

The \p{Lu} matches uppercase Unicode letters. The \p{M} matches diacritics. So, to match a capitalized Unicode letter, use an atomic group (?>\p{Lu}\p{M}*). \p{L} matches any base Unicode letter. So, a word will be sum total of the subpatterns (?>\p{Lu}\p{M}*)(?>\p{L}\p{M}*)*.

C# code:

var line = "This is a String That is my example, Here Is More Text as example.";
var pattern = @"\b(?>\p{Lu}\p{M}*)(?>\p{L}\p{M}*)*(?:\s+(?>\p{Lu}\p{M}*)(?>\p{L}\p{M}*)*){0,2}\b";
var result = Regex.Matches(line, pattern).Cast<Match>().Select(x => x.Value).ToList();

Result: This, String That, Here Is More, Text

Upvotes: 4

Tushar
Tushar

Reputation: 87203

Use + on the group to match more than one occurrences.

(\b[A-Z][a-z]*\s\b)+

Demo

Use {1,3} to match words in one, two or three groups.

(\b[A-Z][a-z]*\s\b){1,3}

Demo

Upvotes: 3

Avinash Raj
Avinash Raj

Reputation: 174706

Define a second pattern and repeat it zero or more..

@"\b[A-Z][a-z]*(?:\s[A-Z][a-z]*)*\b"

DEMO

Upvotes: 3

Related Questions