Daniel Brown
Daniel Brown

Reputation: 3062

C# regex to split string and include matched expression in split

I have the following string:

Bacon ipsum dolor amet **kevin kielbasa** pork chop picanha chuck, 
t-bone **brisket corned beef fatback hamburger cow** sirloin shank prosciutto
shankle. T-bone pancetta ribeye **tongue** fatback drumstick frankfurter short 
ribs burgdoggen. **Tail cupim.**

I want to obtain:

List<string>(){
    "Bacon ipsum dolor amet ",
    "**kevin kielbasa**",
    " pork chop picanha chuck, t-bone ",
    "**brisket corned beef fatback hamburger cow**",
    " sirloin shank prosciutto shankle. T-bone pancetta ribeye ",
    "**tongue**",
    " fatback drumstick frankfurter short ribs burgdoggen. ",
    "**Tail cupim.**"
}

Approaches:

  1. Entirely in Regex:

First Pass

Regex.Split(str, @"\*\*.*?\*\*");

"Bacon ipsum dolor amet ",
" pork chop picanha chuck, t-bone ",
" sirloin shank prosciutto shankle. T-bone pancetta ribeye ",
" fatback drumstick frankfurter short ribs burgdoggen. "

Split removes all of the matching items. It treats each one as a delimiter between the items it thinks we want. D'oh!

Second Pass

Regex.Matches(str, @"\*\*.*?\*\*").Cast<Match>().Select(m => m.Value).ToList();

"**kevin kielbasa**",
"**brisket corned beef fatback hamburger cow**",
"**tongue**",
"**Tail cupim.**"

Well, that makes sense. Regex.Matches() returns all of the items that match the regular expression, so we've lost all of the content between.

  1. With a dash of LINQ:

Okay, let's see if we can get all of our text in a list together:

Regex.Split(str, @"\*\*");

"Bacon ipsum dolor amet ",
"kevin kielbasa",
" pork chop picanha chuck, t-bone ",
"brisket corned beef fatback hamburger cow",
" sirloin shank prosciutto shankle. T-bone pancetta ribeye ",
"tongue",
" fatback drumstick frankfurter short ribs burgdoggen. ",
"Tail cupim."

Oddly, this simple regex gets us the closest, but we no longer know which items in the list were surrounded by **s. Because the ** alternates every list item, all we need to know is if the first (or second) item in the list is surrounded by **.

bool firstIsMatch = "**" == new string(str.Take(2).ToArray());

And then we can use that bool to determine if we're adding "**" to the beginning and end of every even or odd item in the list.

Questions:

Upvotes: 2

Views: 4630

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626691

All you need is to wrap your regex in a capturing group. Once the regex finds the match to split on, the match text will be also pushed into the resulting array. See Regex.Split reference:

If capturing parentheses are used in a Regex.Split expression, any captured text is included in the resulting string array. For example, if you split the string "plum-pear" on a hyphen placed within capturing parentheses, the returned array includes a string element that contains the hyphen.

The empty elements can be easily filtered out later with LINQ:

var str  = "Bacon ipsum dolor amet **kevin kielbasa** pork chop picanha chuck, t-bone **brisket corned beef fatback hamburger cow** sirloin shank prosciutto shankle. T-bone pancetta ribeye **tongue** fatback drumstick frankfurter short ribs burgdoggen. **Tail cupim.**";
var res = Regex.Split(str, @"(\*{2}.*?\*{2})", RegexOptions.Singleline) // Split and keep  the captures
        .Where(s=>!string.IsNullOrWhiteSpace(s)); // Remove blank elements
Console.WriteLine("\"{0}\"", string.Join("\"\n\"", res));

See C# demo.

And a small note on the performance of the pattern: if the text is very large, you might experience a slow down due to the lazy dot matching pattern. It is a good idea to unroll it as @"\*{2}[^*]*(?:\*(?!\*)[^*]*)*\*{2}" especially if there is a small amount of "wild", standalone asterisks (the delimiters).

Upvotes: 2

bobble bubble
bobble bubble

Reputation: 18490

How about using Regex.Matches with pipe in your regex eg

(?s)\*\*.*?\*\*|.+?(?=\*\*|$)

See demo at regex storm

The lookahead in or-part to stop right before ** or $ end.

Upvotes: 2

Wagner DosAnjos
Wagner DosAnjos

Reputation: 6374

Please try the following:

var s = "Bacon ipsum dolor amet **kevin kielbasa** pork chop picanha chuck, " +
"t-bone **brisket corned beef fatback hamburger cow** sirloin shank prosciutto " +
"shankle. T-bone pancetta ribeye **tongue** fatback drumstick frankfurter short " +
"ribs burgdoggen. **Tail cupim.**";

var split = Regex.Split(s, @"(?=\*\*\S)|(?<=\S\*\*)");

foreach (var part in split)
{
    Console.WriteLine(part);
}

// == OUTPUT ==
//
// Bacon ipsum dolor amet 
// **kevin kielbasa**
//  pork chop picanha chuck, t-bone 
// **brisket corned beef fatback hamburger cow**
//  sirloin shank prosciutto shankle. T-bone pancetta ribeye 
// **tongue**
//  fatback drumstick frankfurter short ribs burgdoggen. 
// **Tail cupim.**

Upvotes: 1

Related Questions