s15199d
s15199d

Reputation: 7707

.NET Regex Split String into Word Pairs

I have a string "word1 word2 word3 word4 word5"

I would like to Split that into an array of: "word1 word2" | "word2 word3" | "word3 word4" | "word4 word5"

I can do it using a .NET split and loop, but I'd rather do it with a regex using Regex.Split

Here's the working split and loop:

Dim keywordPairArr As String() = Regex.Split(Trim(keywords), "[ ]")
For i As Integer = 0 To keywordPairArr.Length - 2
    Dim keyword As String = keywordPairArr(i) & " " & keywordPairArr(i + 1)
    If Not keywordDictionary.ContainsKey(keyword) Then
        keywordDictionary.Add(keyword, Regex.Matches(keywords, "[" & keyword & "]+").Count)
    End If
Next

Bonus: Every N-th word would be nice. N=3 would output "word1 word2 word3" | "word2 word3 word4" | "word3 word4 word5"

Any help on the Regex for spliting the string by every Nth [ ]?

Upvotes: 1

Views: 559

Answers (1)

AVIDeveloper
AVIDeveloper

Reputation: 3476

You can use Regex.Matches() for this task.

Here's a C# example that will output the result:

void PrintWordGroups( string input, string pattern )
{
    MatchCollection mc = Regex.Matches( input.Trim(), pattern );
    foreach ( Match m in mc )
    {
        Trace.WriteLine( m.ToString() );
    }
}

void PrintGroupsOf2( string input )
{
    PrintWordGroups( input, @"([^\s]+\s+[^\s]+)\s*" );
}

void PrintGroupsOf3( string input )
{
    PrintWordGroups( input, @"(([^\s]+\s+){2}[^\s]+)\s*" );
}

void PrintGroupsOfN( string input, int n )
{
    string pattern = string.Format( @"(([^\s]+\s+){{{0}}}[^\s]+)\s*", n - 1 );
    PrintWordGroups( input, pattern );
}

Assumptions:

  • The words are delimited by whitespace.
  • The number of words in the input must be multiply of the number of words in a group (e.g. 3, 6, 9, 12, etc. for groups of 3 words).

Patterns Explained:

  1. ([^\s]+\s+[^\s]+)\s* - capture word->whitespace->word->optional whitespace (optional because the last expression won't have it due to the Trim() operation in PrintWordGroups()).
  2. ([^\s]+\s+){2} means: capture word->whitespace twice then finish with another word and then the optional whitespace.
  3. string.Format( @"(([^\s]+\s+){{{0}}}[^\s]+)\s*", n - 1 )
    This is the generic case for capturing N-1 words + whitespaces and then finishing with the Nth word and the optional whitespace.
    For example, if n=6, the formatted string will be: (([^\s]+\s+){5}[^\s]+)\s*.

Upvotes: 2

Related Questions