Cyberlife
Cyberlife

Reputation: 23

Replace \r\n in a csv column using Regex

My question is as follow :

I have a csv file where data of a column are enclosed using double quote (").
The end of line is \r\n (hex value : 0D0A ).

My problem : A column can contains \r\n (it is generally a written sentence with carrier return).
When i want to load the CSV in excel, this \r\n is understood as a new line.

My Problem resolution solution is : Use regex to do the following logic :

Any solution using regex with c# ?

Thanks a lot for your help

Upvotes: 2

Views: 1433

Answers (2)

Arie
Arie

Reputation: 5373

I can't think of easy way to find problematic entries (doesn't mean there isn't any), but there is an easy way to find all entries enclosed in double quotes. Then, you can change them if they're ptoblematic:

var result = Regex.Replace(input, @"""(?:(\r\n)|[^""])+""", delegate(Match m)
            {
                if (string.IsNullOrEmpty(m.Groups[1].Value))
                    return m.Value;
                return m.Value.Replace("\r\n", " ");
            });

Possible problems:

If your file has escaped double quote characters, this may not work. You'll have to test it.

Upvotes: 1

nhahtdh
nhahtdh

Reputation: 56809

I assume that the CSV file is valid in the first place, and double quote " character is not part of the field.

The first assumption is important to reduce the work here. If it doesn't hold true, you are better off using a CSV parser to parse the file, then do replacement on each of the column.

The second assumption can be removed with a small modification, if you know how the double quote is escaped in CSV. If " is specified by "", then change [^\"] in the code below to (?:[^\"]|\"\"). If " is specified by \", then change [^\"] in the code below to (?:[^\\\\\"]|\\\\.).

Read the file line by line and process it as follow:

if (Regex.IsMatch(line, "^\"?[^\"]*\"(,\"[^\"]*\")*$")) {
    // Write the current line to file with line ending
} else {
    // Write the current line to file without line ending
}

To make it easier to understand, let me remove the string literal escaping:

^"?[^"]*"(,"[^"]*")*$

For a valid CSV, the regex above will match the whole row/line if the row doesn't have any column with new line, or the last line of the row if it has some columns with new line. I use it to detect which line to avoid appending new line character.

Upvotes: 0

Related Questions