db_brad
db_brad

Reputation: 913

Regex - Fixing CSV - quotes within quoted text qualifier

At this point I have no control over the source system that generates this file.

I have a csv file that uses double quotes as the text qualifier. Inside qualified text fields, I sometimes have double quotes, to signify inches, etc. Such as the following:

something not qualified,"12" x 12" something qualified, becuase it has a comma",this one is not qualified and needs no fixing a 12" x 12"

These should be escaped with 2 sets of quotes like this:

something not qualified,"12"" x 12"" something qualified, becuase it has a comma",this one is not qualified and needs no fixing a 12" x 12"

I'm trying to write some cleanup code using c# and regex. I can write the code to select everything in between ," and ", but I can't figure out how to just get the double quotes inside of those delimiters.

I can have fields without qualifiers (no commas) that can have a single double quote and don't need to be fixed.

Here is what I have in regexr https://regexr.com/3pq51

((?<=,").*(?=",))

Upvotes: 0

Views: 1861

Answers (4)

TheGeneral
TheGeneral

Reputation: 81483

This is a pretty nasty problem and I'm not actually sure its easily solved (or even possible to solve 100% of the time), and borders on magic.

The one good thing is this is only for inches (so you state), and you know how many fields there are.

However, there are seemingly too many Degrees of Freedom and will always give false positives. i.e what you have is just a malformed CSV, with the following fundamental problems:

  • You can't reliably determine where the fields are without the quotes
  • And you can't reliably determine where the quotes are without the fields.

Maybe someone has a heuristic approach, but I'm guessing any reliable solution will need to return an "I'm not sure" result in some situations...

In summary, regex can't solve this for you.

Maybe it would be best attacking this from the other end.

Upvotes: 1

db_brad
db_brad

Reputation: 913

Thank you everyone for your help. It helped me see that I needed to take a phased approach.

First I get everything inside of ," and ",. Then I find the pattern where there are single double quotes in the pattern they appear and replace with 2 double quotes and a space. I do this for each occurrence just in case.

string matchPattern = "((?<=,\").*?(?=\",))";
string input = "something not qualified,\"12\" x 12\" something qualified, becuase it has a comma\",this one is not qualified and needs no fixing a 12\" x 12\",\"8\" X 8\" sign, plain\",one more";
var newLine = input;

Regex regx = new Regex(matchPattern);
Regex regxReplace = new Regex(@"(?<=\w)""[^\w|\""]");
var matches = regx.Matches(input);

foreach (Match matchingString in matches)
{        

    var value = matchingString.Value;
    if (regxReplace.IsMatch(value))
    {
        changed = true;
        var newReplacementString = regxReplace.Replace(value, "\"\" ");
        newLine = newLine.Replace(matchingString.Value, newReplacementString);
    }
}

return newLine;

Upvotes: 1

revo
revo

Reputation: 48711

If delimiting strings ," and ", are never going to happen inside a quoted string you could use an infinite lookbehind or \G to look for this boundary. Since \G-approach is more suitable and faster in finding a match or failure, I'll go with it:

((?:\A|,)"|\G(?!\A))([^"]*)(")(?!,|\Z)

Regex live demo

C# code (see live demo here):

str = Regex.Replace(str, @"((?:\A|,)""|\G(?!\A))([^""]*)("")(?!,|\Z)", @"$1$2$3$3");

Regex breakdown:

  • ( Start of capturing group one
    • (?:\A|,)" Match start of input string or comma following "
    • | Or
    • \G(?!\A) Match where previous match ended
  • ) End of capturing group one
  • ([^"]*) Capture every thing but "
  • (") Capture "
  • (?!,|\Z) Shouldn't follow a comma or end of input string

Upvotes: 1

wp78de
wp78de

Reputation: 18950

Your input string is a mess and it's probably impossible to come up with a watertight solution. One thing you could try is to capture everything between ," ... ", like that:

[^,]+|,"(.*?)",

This obviously breaks if there is another ", in your qualified string.

Demo

Sample Code:

using System;
using System.Text.RegularExpressions;
public class Program
{
    public static void Main()
    {

        string pattern = @"[^,]+|,""(.*?)"",";
        string input = @"something not qualified,""12"" x 12"" something qualified, becuase it has a comma"",this one is not qualified and needs no fixing a 12"" x 12""";
        RegexOptions options = RegexOptions.Multiline;

        foreach (Match m in Regex.Matches(input, pattern, options))
        {
            if(m.Groups[1].Success)
                Console.WriteLine("'{0}'", m.Groups[1].Value);
            else
                Console.WriteLine("'{0}'", m.Value);
        }
    }
}

Output:

something not qualified
12" x 12" something qualified, becuase it has a comma
this one is not qualified and needs no fixing a 12" x 12"

Upvotes: 1

Related Questions