Reputation: 51
I tried to get regex to work but couldn't (probably because i'm fairly new to regex).
Here's what i want to do:
Consider this text: One word, duel. Limes said bye.
Wanted matches: One word, duel. Limes said bye.
As mentioned previously in the title, i want to get consecutive words matched, one ending with (for example) with "t" and the other one starting with "t" as well, case insensitive.
The closest i got to the answer is with this expression [^a-z][a-z]*([a-z])[^a-z]+\1[a-z]*([a-z])[^a-z]+\2[a-z]*[^a-z]
Upvotes: 3
Views: 493
Reputation: 626870
You may use
(?i)\b(?<w>\p{L}+)(?:\P{L}+(?<w>(\p{L})(?<=\1\P{L}+\1)\p{L}*))+\b
See the regex demo. The results are in Group "w" capture collection.
Details
\b
- a word boundary(?<w>\p{L}+)
- Group "w" (word): 1 or more BMP Unicode letters(?:\P{L}+(?<w>(\p{L})(?<=\1\P{L}+\1)\p{L}*))+
- 1 or more repetitions of
\P{L}+
- 1 or more chars other than BMP Unicode letters(?<w>(\p{L})(?<=\1\P{L}+\1)\p{L}*)
- Group "w":
(\p{L})
- a letter captured into Group 1(?<=\1\P{L}+\1)
- immediately to the left of the current position, there must be the same letter as captured in Group 1, 1+ chars other than letters, and the letter in Group 1\p{L}*
- 0 or more letters\b
- a word boundary.var text = "One word, duel. Limes said bye.";
var pattern = @"\b(?<w>\p{L}+)(?:\P{L}+(?<w>(\p{L})(?<=\1\P{L}+\1)\p{L}*))+\b";
var result = Regex.Match(text, pattern, RegexOptions.IgnoreCase)?.Groups["w"].Captures
.Cast<Capture>()
.Select(x => x.Value);
Console.WriteLine(string.Join(", ", result)); // => word, duel, Limes, said
A C# demo version without using LINQ:
string text = "One word, duel. Limes said bye.";
string pattern = @"\b(?<w>\p{L}+)(?:\P{L}+(?<w>(\p{L})(?<=\1\P{L}+\1)\p{L}*))+\b";
Match result = Regex.Match(text, pattern, RegexOptions.IgnoreCase);
List<string> output = new List<string>();
if (result.Success)
{
foreach (Capture c in result.Groups["w"].Captures)
output.Add(c.Value);
}
Console.WriteLine(string.Join(", ", output));
Upvotes: 3
Reputation: 163362
If a word consists of at least 2 characters a-z, you might use 2 capturing groups with an alternation in a positive lookahead to check if the next word starts with the last char or if the previous word ended and the current word starts with the last char.
With case insensitive match enabled:
\b([a-z])[a-z]*([a-z])\b(?:(?=[,.]? \2)|(?<=\1 \1[a-z]+))
\b
Word boundary([a-z])
Capture group 1 Match a-z[a-z]*
Match 0+ times a-z in between([a-z])
Capture group 2 Match a-z\b
Word boundary(?:
Non capturing group
(?=
Positive lookahead, assert what is on the right is
[,.]? \2
an optional .
or ,
space and what is captured in group 2)
Close lookahead|
Or(?<=
Positive lookbehind, assert what is on the left is
\1 \1[a-z]+
Match what is captured in group 1 and space and 1+ times a char a-z)
Close lookbehind)
Close non capturing groupNote that matching [a-zA-Z]
is a small range for a word. You might use \w
or \p{L}
instead.
Upvotes: 1