user948620
user948620

Reputation:

Java : Regular Expression escape Regular Expression

This sample data is returned by Web Service

200,6, "California, USA"

I want to split them using split(",") and tried to see the result using simple code.

String loc = "200,6,\"California, USA\"";       
String[] s = loc.split(",");

for(String f : s)
   System.out.println(f);

Unfortunately this is the result

200
6
"California
 USA"

The expected result should be

200
6
"California, USA"

I tried different regular expressions and no luck. Is it possible to escape the given regular expression inside of "" ?

UPDATE 1: Added C# Code

UPDATE 2: Removed C# Code

Upvotes: 7

Views: 249

Answers (4)

Praveen Kumar Patidar
Praveen Kumar Patidar

Reputation: 77

Hello Try this Expression.

public class Test {

    /**
     * @param args
     */
    public static void main(String[] args) {
        String loc = "200,6,\"Paris, France\"";  
        String[] str1 =loc.split(",(?=(?:[^\"]|\"[^\"]*\")*$)");

        for(String tmp : str1 ){
            System.out.println(tmp);
        }

    }

}

Upvotes: 0

JohnnyO
JohnnyO

Reputation: 3068

An easier solution might be to use an existing library, such as OpenCSV to parse your data. This can be accomplished in two lines using this library:

CSVParser parser = new CSVParser();
String [] data = parser.parseLine(inputLine);

This will become especially important if you have more complex CSV values coming back in the future (multiline values, or values with escaped quotes inside an element, etc). If you don't want to add the dependency, you could always use their code as a reference (though it is not based on RegEx)

Upvotes: 2

Patashu
Patashu

Reputation: 21773

If there's a good lexer/parser library for Java, you could define a lexer like the following pseudo-lexer code:

Delimiter: ,
Item: ([^,"]+) | ("[^,"]+")
Data: Item Delimiter Data | Item 

How lexers work is that it starts at the top level token definition (in this case Data) and attempts to form tokens out of the string until it cannot or until the string is all gone. So in the case of your string the following would happen:

  • I want to make Data out of 200,6, "California, USA".
  • I can make Data out of an Item, a Delimiter and Data.
  • I looked - 200 is an Item and then , is a Delimiter so I can tokenize that and keep going.
  • I want to make data out of 6, "California, USA"
  • I can make Data out of an Item, a Delimiter and Data.
  • I looked - 6 is an Item and then , is a Delimiter so I can tokenize that and keep going.
  • I want to make data out of "California, USA"
  • I can make Data out of an Item, a Delimiter and Data.
  • I looked - "California, USA" is an Item, but I see no Delimiter after it, so let's try something else.
  • I can make Data out of an Item.
  • I looked - "California, USA" is an item, so I can tokenize that and keep going.
  • The string is empty. I'm done. Here's your tokens.

(I learned about how lexers work from the guide to PLY, a Python lexer/parser: http://www.dabeaz.com/ply/ply.html )

Upvotes: 0

Aniket Lawande
Aniket Lawande

Reputation: 131

,(?=(?:[^"]|"[^"]*")*$)

This is the regex you want (To put it in the split function you'll need to escape the quotes in the string)

Explanation

You need to find all ','s not in quotes.. That is you need lookahead (http://www.regular-expressions.info/lookaround.html) to see whether your current matching comma is within quotes or out.

To do that we use lookahead to basically ensure the current matching ',' is followed by an EVEN number of '"' characters (meaning that it lies outside quotes)

So (?:[^"]|"[^"]*")*$ means match only when there are non quote characters till the end OR a pair of quotes with anything in between them

(?=(?:[^"]|"[^"]*")*$) will lookahead for the above match

,(?=(?:[^"]|"[^"]*")*$) and finally this will match all ',' with the above lookahead

Upvotes: 3

Related Questions