Reputation: 11
I have a text file, and I need to remove some trailing delimiters. The text file looks like this:
string text = @"1|'Nguyen Van| A'|'Nguyen Van A'|39
2|'Nguyen Van B'|'Nguyen| Van B'|39";
string result = @"1|'Nguyen Van A'|'Nguyen Van A'|39
2|'Nguyen Van B'|'Nguyen Van B'|39";
I want to remove the char "|" In the string "Nguyen Van | A" and "Nguyen | Van B"
So I think the best way is to do a Regex replace? Can anyone help me with this regex?
Thanks
Upvotes: 0
Views: 740
Reputation: 17428
You mentioned using the multiline regex is taking too long and asked about the state machine approach. So here is some code using a function to perform the operation (note, the function could probably use a little cleaning, but it shows the idea and works faster than the regex). In my testing, using the regex without multiline, I could process 1,000,000 lines (in memory, not writing to a file) in about 34 seconds. Using the state-machine approach it was about 4 seconds.
string RemoveInternalPipe(string line)
{
int count = 0;
var temp = new List<char>(line.Length);
foreach (var c in line)
{
if (c == '\'')
{
++count;
}
if (c == '|' && count % 2 != 0) continue;
temp.Add(c);
}
return new string(temp.ToArray());
};
File.WriteAllLines(@"yourOutputFile",
File.ReadLines(@"yourInputFile").Select(x => RemoveInternalPipe(x)));
To compare the performance against the Regex
version (without the multiline option), you could run this code:
var regex = new Regex(@"(?<=^[^']*'([^']*'[^']*')*[^']*)\|");
File.WriteAllLines(@"yourOutputFile",
File.ReadLines(@"yourInputFile").Select(x => regex.Replace(x, string.Empty));
Upvotes: 0
Reputation: 111820
The regex should be:
(?<=^[^']*'([^']*'[^']*')*[^']*)\|
to be used Multiline... so
var rx = new Regex(@"(?<=^[^']*'([^']*'[^']*')*[^']*)\|", RegexOptions.Multiline);
string text = @"1|'Nguyen Van| A'|'Nguyen Van A'|39
2|'Nguyen Van B'|'Nguyen| Van B'|39";
string replaced = rx.Replace(text, string.Empty);
Example: http://ideone.com/PTdsg5
I strongly suggest against using it... To explain why... Try to comprehend the regular expression. If you can comprehend it, then you can use it :-)
I would write a simple state machine that counts '
and replaces the |
when the counted '
is odd.
Upvotes: 1