Shaik Allabakash
Shaik Allabakash

Reputation: 35

Parsing CSV in Java which applies text qualifier only when the content has comma

I've a CSV file whose content is as follows:

1,"hello, there",I have a csv in which,"only when ""double quote"" or comma are there in the content",it will be wrapped in the double quotes,otherwise not,something like 1/2" will not be wrapped up in double quotes.

I used OpenCSV and other CSV libraries for parsing but it didn't work.

I used the Regular Expression quoted in the StackOverflow question but it also didn't work.

However, when I open it in Excel it is working fine. Can someone give me a hint regarding how to parse this CSV file.

Note that when the content contains a comma, then only it is wrapped up in the text qualifier. When such content is wrapped up in double quotes, and the double quote is part of the content, then it is escaped with the double quote. In other words, it changes to double double quote. But if the content has a double quote, then it is not wrapped up in the text qualifiers.

Please advise regarding this.

The output of the above content when parsed should be as below:

The output should be as follows:

1
hello, there
I have a csv in which
only whn "double quote" or comma are there in the content
it will be wrapped in the double quotes
otherwise not
something like 1/2" will not be wrapped up in double quotes.

I tried using open csv and also tried split using the regular expression:

",(?=([^\"]*\"[^\"]*\")*[^\"]*$)"

But of no use.

My Data is like below:

PRODUCT,,1/2" 18V CORDLESS XRP LI-LON DRILL/DRIVE,P,2510906459,,DEWALT TOOLS,,,<br><img src="http://example.com/image.png"><br><br><p><b>UNIT OF MEASURE: EA<br><br> QTY PER UNIT OF MEASURE: 1<br><br> MINIMUM ORDER QUANTITY: 1<br></P></b>DEWALT TOOLS DCD960KL - 1/2" 18V CORDLESS XRP LI-LON DRILL/DRIVER KIT - XRP™ CORDLESS DRILLS - BEST IN CLASS LENGTH FOR IMPROVED BALANCE AND BETTER CONTROL|LED WORKLIGHT PROVIDES INCREASED VISIBILITY IN CONFINED SPACES|PATENTED 3-SPEED ALL-METAL TRANSMISSION MATCHES THE TOOL TO TASK FOR FASTEST APPLICATION SPEED AND IMPROVED  - EQUAL TO 115-DCD960KL,

Want this to be parsed as below (I used to represent an empty cell when we see it in excel)

PRODUCT
<BLANK>
1/2" 18V CORDLESS XRP LI-LON DRILL/DRIVE
P
2510906459
<BLANK>
DEWALT TOOLS
<BLANK>
<BLANK>
<br><img src="http://example.com/image.png"><br><br><p><b>UNIT OF MEASURE: EA<br><br> QTY PER UNIT OF MEASURE: 1<br><br> MINIMUM ORDER QUANTITY: 1<br></P></b>DEWALT TOOLS DCD960KL - 1/2" 18V CORDLESS XRP LI-LON DRILL/DRIVER KIT - XRP™ CORDLESS DRILLS - BEST IN CLASS LENGTH FOR IMPROVED BALANCE AND BETTER CONTROL|LED WORKLIGHT PROVIDES INCREASED VISIBILITY IN CONFINED SPACES|PATENTED 3-SPEED ALL-METAL TRANSMISSION MATCHES THE TOOL TO TASK FOR FASTEST APPLICATION SPEED AND IMPROVED  - EQUAL TO 115-DCD960KL

Upvotes: 1

Views: 1163

Answers (2)

Jeronimo Backes
Jeronimo Backes

Reputation: 6289

I had no problems parsing your input with uniVocity-parsers:

    String input = "PRODUCT,,1/2\" 18V CORDLESS XRP LI-LON DRILL/DRIVE,P,2510906459,,DEWALT TOOLS,,,<br><img src=\"http://example.com/image.png\"><br><br><p><b>UNIT OF MEASURE: EA<br><br> QTY PER UNIT OF MEASURE: 1<br><br> MINIMUM ORDER QUANTITY: 1<br></P></b>DEWALT TOOLS DCD960KL - 1/2\" 18V CORDLESS XRP LI-LON DRILL/DRIVER KIT - XRP™ CORDLESS DRILLS - BEST IN CLASS LENGTH FOR IMPROVED BALANCE AND BETTER CONTROL|LED WORKLIGHT PROVIDES INCREASED VISIBILITY IN CONFINED SPACES|PATENTED 3-SPEED ALL-METAL TRANSMISSION MATCHES THE TOOL TO TASK FOR FASTEST APPLICATION SPEED AND IMPROVED  - EQUAL TO 115-DCD960KL,";
    Reader reader = new StringReader(input);

    CsvParserSettings settings = new CsvParserSettings(); //many options here, check the tutorial.
    settings.setNullValue("<BLANK>"); //use that to obtain <BLANK> to represent nulls

    String[] row = new CsvParser(settings).parseAll(reader).get(0);
    for(String element : row){
        System.out.println(element);
    }

Output:

PRODUCT
<BLANK>
1/2" 18V CORDLESS XRP LI-LON DRILL/DRIVE
P
2510906459
<BLANK>
DEWALT TOOLS
<BLANK>
<BLANK>
<br><img src="http://example.com/image.png"><br><br><p><b>UNIT OF MEASURE: EA<br><br> QTY PER UNIT OF MEASURE: 1<br><br> MINIMUM ORDER QUANTITY: 1<br></P></b>DEWALT TOOLS DCD960KL - 1/2" 18V CORDLESS XRP LI-LON DRILL/DRIVER KIT - XRP™ CORDLESS DRILLS - BEST IN CLASS LENGTH FOR IMPROVED BALANCE AND BETTER CONTROL|LED WORKLIGHT PROVIDES INCREASED VISIBILITY IN CONFINED SPACES|PATENTED 3-SPEED ALL-METAL TRANSMISSION MATCHES THE TOOL TO TASK FOR FASTEST APPLICATION SPEED AND IMPROVED  - EQUAL TO 115-DCD960KL
<BLANK>

Disclaimer: I'm the author of this library, it's open-source and free (Apache 2.0 license)

Upvotes: 2

Saleem
Saleem

Reputation: 8978

Try following regex:

Stream<String> lines = Files.lines(Paths.get("path to csv file"));

Pattern regex = Pattern.compile("\"(.*?)\"(?=,|$)|(?<=(?:,|^))(.*?)(?=,|$)",
        Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);

lines.forEach( line -> {
    Matcher matcher = regex.matcher(line);
    while (matcher.find()) {
        String content = matcher.group(1) == null ? matcher.group() : matcher.group(1);
        System.out.println(content);
    }
});

Based on sample input text

1,"hello, there",I have a csv in which,
"only when ""double quote"" or comma are there in the content",
it will be wrapped in the double quotes,otherwise not,
something like 1/2" will not be wrapped up in double quotes.

It will emit.

1
hello, there
I have a csv in which
only when ""double quote"" or comma are there in the content
it will be wrapped in the double quotes
otherwise not
something like 1/2" will not be wrapped up in double quotes.

Upvotes: 1

Related Questions