Reputation: 1717
I have given the following CSV file:
"A";"B
C";"D"
"E";"F"
"G
H"
And I need to get rid of newline characters that are placed in the text, except the ones placed directly between two delimiters (double quotes ""
in that case). Meaning only newline characters, that are the only sign between two double quotes ("\n"
) should be present in the file.
The idea is to have a regex, that will:
Find all new lines characters, except the ones between double quotes (except the ones in the end of the line, as each line always starts and finishes with a double quote) and replace them by a space.
So the result of processing abovementioned file should be:
"A";"B C";"D"
"E";"F"
"G H"
So in other words, regex to find new lines characters between delimiters should:
Find all \n
, except "\n"
I tried the following regex: [^"\n"][\n]
to match every \n
except "\n"
, but testing this in Sublime Text 2 it selects the wanted new line character, but also a single character before it:
Which means that when I replace those found places with space, it will also replace the B
and G
characters, which is not expected.
I would also like to mention that I will use that Regex to perform replacement operation in C# code.
Do you have any idea how to make this working as I want ?
EDIT 2016-07-14:
I tried what OmegaMan proposed below. It works fine for the case I specified above, however I didn't spot that it may came across multiple lines in the file. Example:
Given CSV file (referred as "pathToTheExampleFile" later in C# code):
"A";"B
C";"D"
"E";"F";"Part1
Part2
Part3
";
Using OmegaMan's solution in the code, I get this result:
"A";"B C";"D"
"E";"F";"Part1 Part2
Part3
";
Whereas it is expected to be:
"A";"B C";"D"
"E";"F";"Part1 Part2 Part3 ";
However, after executing OmegaMan's Replace operation three times, simply like that:
string data = File.ReadAllText(@"pathToTheExampleFile");
string pattern = @"(?<=\x22[^\x22\r\n]+)([\r\n]+)";
var result = Regex.Replace(data, pattern, " ");
result = Regex.Replace(result, pattern, " ");
result = Regex.Replace(result, pattern, " ");
Console.WriteLine(result);
I get exactly the result I want. So it's multiline matching what is needed. I don't see any other cases in which the regex may not work.
If you have any idea how to modify the regex, would be appreciated.
EDIT 2016-07-15:
I have to add that it works adding this ugly solution:
while (Regex.IsMatch(data, pattern))
{
data = Regex.Replace(data, pattern, " ");
}
In the end, data
contains expected string. It's very ugly, but I'm sure it's doable somehow with the regex.
Upvotes: 2
Views: 773
Reputation: 1
Try
string pattern = @"([^\x22])(\r\n)+|(;\x22)\r\n";
string result = Regex.Replace(data, pattern, "$1$3 ");
I got
"A";"B C";"D"
"E";"F";"Part1 Part2 Part3 "
"G G2 G3";"H";" I I2 I3 "
Fot this input:
"A";"B
C";"D"
"E";"F";"Part1
Part2
Part3
"
"G
G2
G3";"H";"
I
I2
I3
"
See https://dotnetfiddle.net/uc538C
Upvotes: 0
Reputation: 31616
By using a non consuming look behind to verify that there is an open quoted text, this will do the job and replace the following \r\n
with a space:
string data = "\"A\";\"B\r\nC\";\"D\"\r\n\"E\";\"F\"\r\n\"G\r\nH\"";
string pattern = @"(?<=\x22[^\x22\r\n]+)([\r\n]+)";
Regex.Replace(data, pattern, " " )
Note that \x22
is an escape for "
.
Replace returns this:
"A";"B C";"D"
"E";"F"
"G H"
Upvotes: 1
Reputation: 186698
I suggest easy to implement looping instead of complex regular expression:
private static String trimNewLines(String value) {
if (null == value)
return value;
StringBuilder sb = new StringBuilder(value.Length);
Boolean inQuotation = false;
foreach (char ch in value) {
if (ch == '"')
inQuotation = !inQuotation;
if (inQuotation || ch != '\r' || ch != '\n')
sb.Append(ch);
}
return sb.ToString();
}
...
String result = trimNewLines(File.ReadAllText(@"c:\MyData.csv"));
Upvotes: 0
Reputation: 785156
You can use lookahead based regex to search:
\n(?!")
and replace it by:
" "
\n(?!")
will match any \n
that are not followed by a double quote.
Upvotes: 1