Reputation: 3062
I have the following string:
Bacon ipsum dolor amet **kevin kielbasa** pork chop picanha chuck,
t-bone **brisket corned beef fatback hamburger cow** sirloin shank prosciutto
shankle. T-bone pancetta ribeye **tongue** fatback drumstick frankfurter short
ribs burgdoggen. **Tail cupim.**
I want to obtain:
List<string>(){
"Bacon ipsum dolor amet ",
"**kevin kielbasa**",
" pork chop picanha chuck, t-bone ",
"**brisket corned beef fatback hamburger cow**",
" sirloin shank prosciutto shankle. T-bone pancetta ribeye ",
"**tongue**",
" fatback drumstick frankfurter short ribs burgdoggen. ",
"**Tail cupim.**"
}
Approaches:
First Pass
Regex.Split(str, @"\*\*.*?\*\*");
"Bacon ipsum dolor amet ",
" pork chop picanha chuck, t-bone ",
" sirloin shank prosciutto shankle. T-bone pancetta ribeye ",
" fatback drumstick frankfurter short ribs burgdoggen. "
Split removes all of the matching items. It treats each one as a delimiter between the items it thinks we want. D'oh!
Second Pass
Regex.Matches(str, @"\*\*.*?\*\*").Cast<Match>().Select(m => m.Value).ToList();
"**kevin kielbasa**",
"**brisket corned beef fatback hamburger cow**",
"**tongue**",
"**Tail cupim.**"
Well, that makes sense. Regex.Matches()
returns all of the items that match the regular expression, so we've lost all of the content between.
Okay, let's see if we can get all of our text in a list together:
Regex.Split(str, @"\*\*");
"Bacon ipsum dolor amet ",
"kevin kielbasa",
" pork chop picanha chuck, t-bone ",
"brisket corned beef fatback hamburger cow",
" sirloin shank prosciutto shankle. T-bone pancetta ribeye ",
"tongue",
" fatback drumstick frankfurter short ribs burgdoggen. ",
"Tail cupim."
Oddly, this simple regex gets us the closest, but we no longer know which items in the list were surrounded by **
s. Because the **
alternates every list item, all we need to know is if the first (or second) item in the list is surrounded by **
.
bool firstIsMatch = "**" == new string(str.Take(2).ToArray());
And then we can use that bool to determine if we're adding "**" to the beginning and end of every even or odd item in the list.
Questions:
Upvotes: 2
Views: 4630
Reputation: 626691
All you need is to wrap your regex in a capturing group. Once the regex finds the match to split on, the match text will be also pushed into the resulting array. See Regex.Split
reference:
If capturing parentheses are used in a
Regex.Split
expression, any captured text is included in the resulting string array. For example, if you split the string "plum-pear" on a hyphen placed within capturing parentheses, the returned array includes a string element that contains the hyphen.
The empty elements can be easily filtered out later with LINQ:
var str = "Bacon ipsum dolor amet **kevin kielbasa** pork chop picanha chuck, t-bone **brisket corned beef fatback hamburger cow** sirloin shank prosciutto shankle. T-bone pancetta ribeye **tongue** fatback drumstick frankfurter short ribs burgdoggen. **Tail cupim.**";
var res = Regex.Split(str, @"(\*{2}.*?\*{2})", RegexOptions.Singleline) // Split and keep the captures
.Where(s=>!string.IsNullOrWhiteSpace(s)); // Remove blank elements
Console.WriteLine("\"{0}\"", string.Join("\"\n\"", res));
See C# demo.
And a small note on the performance of the pattern: if the text is very large, you might experience a slow down due to the lazy dot matching pattern. It is a good idea to unroll it as @"\*{2}[^*]*(?:\*(?!\*)[^*]*)*\*{2}"
especially if there is a small amount of "wild", standalone asterisks (the delimiters).
Upvotes: 2
Reputation: 18490
How about using Regex.Matches
with pipe in your regex eg
(?s)\*\*.*?\*\*|.+?(?=\*\*|$)
The lookahead in or-part to stop right before **
or $
end.
Upvotes: 2
Reputation: 6374
Please try the following:
var s = "Bacon ipsum dolor amet **kevin kielbasa** pork chop picanha chuck, " +
"t-bone **brisket corned beef fatback hamburger cow** sirloin shank prosciutto " +
"shankle. T-bone pancetta ribeye **tongue** fatback drumstick frankfurter short " +
"ribs burgdoggen. **Tail cupim.**";
var split = Regex.Split(s, @"(?=\*\*\S)|(?<=\S\*\*)");
foreach (var part in split)
{
Console.WriteLine(part);
}
// == OUTPUT ==
//
// Bacon ipsum dolor amet
// **kevin kielbasa**
// pork chop picanha chuck, t-bone
// **brisket corned beef fatback hamburger cow**
// sirloin shank prosciutto shankle. T-bone pancetta ribeye
// **tongue**
// fatback drumstick frankfurter short ribs burgdoggen.
// **Tail cupim.**
Upvotes: 1