mvd
mvd

Reputation: 2720

Validate csv file with regular expression in Java

File structure is as such:

"group","type","scope","name","attribute","value"
"c","","Probes Count","Counter","value","35"
"b","ProbeInformation","Probes Count","Gauge","value","0"

Always using quotes. There is a trailing newline as well.

Here is what I have:

^(\"[^,\"]*\")(,(\"[^,\"]*\"))*(.(\"[^,\"]*\")(,(\"[^,\"]*\")))*.$

That is not matching correctly. I'm using String.matches(regexp);

Upvotes: 0

Views: 3950

Answers (2)

KJP
KJP

Reputation: 519

Disclaimer: I didn't even try compiling my code, but this pattern has worked before.

When I can't see at a glance what a regex does, I break it out into lines so it's easier to figure out what's going on. Mismatched parens are more obvious and you can even add comments to it. Also, let's add the Java code around it so escaping oddities become clear.

^(\"[^,\"]*\")(,(\"[^,\"]*\"))*(.(\"[^,\"]*\")(,(\"[^,\"]*\")))*.$

becomes

String regex = "^" +
               "(\"[^,\"]*\")" +
               "(," +
                 "(\"[^,\"]*\")" +
               ")*" +
               "(." +
                 "(\"[^,\"]*\")" +
                 "(," +
                    "(\"[^,\"]*\")" +
                 ")" +
               ")*" +
               ".$";

Much better. Now to business: the first thing I see is your regex for the quoted values. It doesn't allow for commas within the strings - which probably isn't what you want - so let's fix that. Let's also put it in its own variable so we don't mis-type it at some point. Lastly, let's add comments so we can verify what the regex is doing.

final String QUOTED_VALUE = "\"[^\"]*\""; // A double quote character, zero or more non-double quote characters, and another double quote
String regex = "^" +                           // The beginning of the string
               "(" + QUOTED_VALUE + ")" +      // Capture the first value
               "(," +                          // Start a group, a comma
                 "(" + QUOTED_VALUE + ")" +    // Capture the next value
               ")*" +                          // Close the group.  Allow zero or more of these
               "(." +                          // Start a group, any character
                 "(" + QUOTED_VALUE + ")" +      // Capture another value
                 "(," +                            // Started a nested group, a comma
                    "(" + QUOTED_VALUE + ")" +     // Capture the next value
                 ")" +                             // Close the nested group
               ")*" +                            // Close the group.  Allow zero or more
               ".$";                           // Any character, the end of the input

Things are getting even clearer. I see two big things here:

1) (I think) you're trying to match the newline in your input string. I'll play along, but it's cleaner and easier to split the input on a newline than what you're doing (that's an exercise you can do yourself though). You also need to be mindful of the different newline conventions that different operating systems have (read this).

2) You're capturing too much. You want to use non-capturing groups or parsing your output is going to be difficult and error-prone (read this).

final String QUOTED_VALUE = "\"[^\"]*\""; // A double quote character, zero or more non-double quote characters, and another double quote
final String NEWLINE = "(\n|\n\r|\r\n)";  // A newline for (almost) any OS: Windows, *NIX or Mac
String regex = "^" +                           // The beginning of the string
               "(" + QUOTED_VALUE + ")" +   // Capture the first value
               "(?:," +                       // Start a group, a comma
                 "(" + QUOTED_VALUE + ")" + // Capture the next value
               ")*" +                       // Close the group.  Allow zero or more of these
               "(?:" + NEWLINE +            // Start a group, any character
                 "(" + QUOTED_VALUE + ")" +   // Capture another value
                 "(?:," +                       // Started a nested group, a comma
                    "(" + QUOTED_VALUE + ")" +  // Capture the next value
                 ")" +                          // Close the nested group
               ")*" +                         // Close the group.  Allow zero or more
               NEWLINE + "$";                 // A trailing newline, the end of the input

From here, I see you duplicating work again. Let's fix that. This also fixes a missing * in your original regex. See if you can find it.

final String QUOTED_VALUE = "\"[^\"]*\""; // A double quote character, zero or more non-double quote characters, and another double quote
final String NEWLINE = "(\n|\n\r|\r\n)";  // A newline for (almost) any OS: Windows, *NIX or Mac
final String LINE = "(" + QUOTED_VALUE + ")" +   // Capture the first value
                    "(?:," +                       // Start a group, a comma
                      "(" + QUOTED_VALUE + ")" + // Capture the next value
                    ")*";                        // Close the group.  Allow zero or more of these
String regex = "^" +             // The beginning of the string
               LINE +            // Read the first line, capture its values
               "(?:" + NEWLINE + // Start a group for the remaining lines
                 LINE +            // Read more lines, capture their values
               ")*" +            // Close the group.  Allow zero or more
               NEWLINE + "$";    // A trailing newline, the end of the input

That's a little easier to read, no? Now you can test your big nasty regex in pieces if it doesn't work.

You can now compile the regex, get the matcher, and grab the groups from it. You still have a few issues though:

1) I said earlier that it would be easier to break on newlines. One reason is: how do you determine how many values do you have per line? Hard-coding it will work, but it'll break as soon as your input changes. Maybe this isn't a problem for you, but it's still bad practice. Another reason: the regex is still too complex for my liking. You could really get away with stopping at LINE.

2) CSV files allow lines like this:

"some text","123",456,"some more text"

To handle this you might want to add another mini-regex that gets either a quoted value or a list of digits.

Upvotes: 2

Kyle Burton
Kyle Burton

Reputation: 27588

This question: CSV Parsing in Java points to an Apache library for parsing CSV.

If your format is indeed CSV, it is going to be very difficult to regular expressions are going to parse the data into records.

I know this doesn't answer your question directly, you will probably have more success with less effort by using a CSV library.

Upvotes: 0

Related Questions