Reputation: 1194
I have a text file I'm reading with a C#
program and am needing to split it's contents. I decided to use Regex.Split()
.
The pattern I am trying to look for is \n( )+Copyright
here's an example of the text:
\r\n\r\nLANGUAGE: ENGLISH\r\n\r\nDISTRIBUTION: Every Zone\r\n\r\nPUBLICATION-TYPE: Newspaper\r\n\r\n\r\n Copyright 2014 Washingtonpost.Newsweek Interactive Company, LLC d/b/a\r\n Washington Post Digital\r\n All Rights Reserved\r\n"
the reason for the newline being included is because I also have instances where the word copyright shows up in a paragraph:
\r\n\r\nFrom Blood Aces by Doug Swanson, to be published by Viking, a member of Penguin\r\nGroup (USA) LLC on Aug. 14, 2014. Copyright © 2014 by Doug J. Swanson.\r\n
however the problem i have is when I perform this call:
var splitContent= Regex.Split(filecontent, @"\n( )+Copyright");
i get over 2x as many items in splitContent
as there should be. I've tried modifying the regex pattern to @"(\n){1}?( )+Copyright"
and a few other similar type patterns I get 4-5x the number of items in splitContent
I should be getting.
Is this the correct way to be performing this kind of regex?
Any help would be greatly appreciated.
Upvotes: 0
Views: 152
Reputation: 74197
Why try to reinvent the wheel? Just change your regular expression to use the correct options:
RegexOptions options = RegexOptions.Multiline
| RegexOptions.Ignorecase
;
Regex rxCopyright = new Regex( "^\s*Copyright", options );
string[] lines = rxCopyright.Split( yourStringHere ) ;
RegexOptions.Multiline
tells the regular expression engine to
Use multiline mode, where
^
and$
match the beginning and end of each line (instead of the beginning and end of the input string). For more information, see Multiline Mode.
So your corpus of text will be split into blocks using the word copyright
if at the beginning of any line (with or without leading whitespace).
And if you want to use parentheses for clarity, add RegexOptions.ExplicitCapture
to the mix. It
Specifies that the only valid captures are explicitly named or numbered groups of the form
(?<name>…)
. This allows unnamed parentheses to act as noncapturing groups without the syntactic clumsiness of the expression(?:…)
.
Upvotes: 0
Reputation: 22102
If capturing parentheses are used in a
Regex.Split
expression, any captured text is included in the resulting string array. For example, if you split the string "plum-pear" on a hyphen placed within capturing parentheses, the returned array includes a string element that contains the hyphen.using System; using System.Text.RegularExpressions; public class Example { public static void Main() { string input = "plum-pear"; string pattern = "(-)"; string[] substrings = Regex.Split(input, pattern); // Split on hyphens foreach (string match in substrings) { Console.WriteLine("'{0}'", match); } } } // The example displays the following output: // 'plum' // '-' // 'pear'
Upvotes: 1