Reputation: 7244
I'm trying to find words within a string that is single, or grouped together.
For example:
This is a String That is my example, Here Is More text as example.
I want to take out so my result is the following.
This
String That
Here Is More
The regex I have so far is this
(\b[A-Z][a-z]*\s\b)
This finds capitalized words but only groups them separately containing the space. How can control the regex to accept 1 to 3 words in a row, with capital letters?
Upvotes: 2
Views: 810
Reputation: 626845
A truly Unicode supporting solution is
\b(?>\p{Lu}\p{M}*)(?>\p{L}\p{M}*)*(?:\s+(?>\p{Lu}\p{M}*)(?>\p{L}\p{M}*)*){0,2}\b
It will only match 1-3 capitalized words at a row without leading/trailing whitespace.
See regex demo
Here is the explanation:
\b
- word boundary (there should be a non-word character before it)(?>\p{Lu}\p{M}*)(?>\p{L}\p{M}*)*
- a word starting with an uppercase letter (followed by optional diacritics) then followed with any (precomposed, too) Unicode letters(?:\s+(?>\p{Lu}\p{M}*)(?>\p{L}\p{M}*)*){0,2}
- 2 to 0 occurrences of
\s+
- 1 or more whitespaces (\s+
) followed by...(?>\p{Lu}\p{M}*)(?>\p{L}\p{M}*)*
- a word consisting of Unicode letters (potentially with diacritics).The \p{Lu}
matches uppercase Unicode letters. The \p{M}
matches diacritics. So, to match a capitalized Unicode letter, use an atomic group (?>\p{Lu}\p{M}*)
. \p{L}
matches any base Unicode letter. So, a word will be sum total of the subpatterns (?>\p{Lu}\p{M}*)(?>\p{L}\p{M}*)*
.
var line = "This is a String That is my example, Here Is More Text as example.";
var pattern = @"\b(?>\p{Lu}\p{M}*)(?>\p{L}\p{M}*)*(?:\s+(?>\p{Lu}\p{M}*)(?>\p{L}\p{M}*)*){0,2}\b";
var result = Regex.Matches(line, pattern).Cast<Match>().Select(x => x.Value).ToList();
Result: This
, String That
, Here Is More
, Text
Upvotes: 4
Reputation: 87203
Use +
on the group to match more than one occurrences.
(\b[A-Z][a-z]*\s\b)+
Use {1,3}
to match words in one, two or three groups.
(\b[A-Z][a-z]*\s\b){1,3}
Upvotes: 3
Reputation: 174706
Define a second pattern and repeat it zero or more..
@"\b[A-Z][a-z]*(?:\s[A-Z][a-z]*)*\b"
Upvotes: 3