RegEx : Double quotes unable to escape delimiter

Question

I have written a program which is going to read from csv file using a delimiter and at the same time I have a use case where I needed to create a string from of the delimited data so for that I have created a regex to split the column data using the delimiter.

Now the challenge is when delimiter is present in in double quotes ideally I should not be splitting the data, spark is escaping that delimiter but my regex somehow are not.

    private static void readFromSourceFile(SparkSession sparkSession) {
    String delType = ",";
    final String regex = "["+delType+ "]{"+delType.length()+"}(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)";
    Dataset csv = sparkSession
            .read().option("delimiter",delType)
            .option("header",false)
            .option("inferSchema",true)
            .csv("src/main/resources/quotes2.csv");

    char separator= '\u0001';
    
    csv.show(false);
    List df = csv.collectAsList();
    String split[] =  df.get(0).toString().split(regex);
    System.out.println(split.length);
    Arrays.stream(split).forEach(System.out::println);

}

The O/P for the program is -

The Red marked area is the string with double quotes, it shouldn't have split that column.

Input file csv file -

New,667.88,In Stock.,Now,true,true,B09D7MQ69X,B09D7MQ69X,NUC10i5FNHN 16GB+512GB,"Intel NUC10 NUC10i5FNHN Home & Business Desktop Mini PC,10th Generation Intel® Core™ i5-10210U, Upto 4.2 GHz, 4 core, 8 Thread, 25W Intel® UHD Graphics, 16GB RAM, 512GB PCIe SSD, Win 10 Pro 8GB RAM + 256GB SSD",false,"【Intel NUC10i5FNHN with RAM & SSD】 Intel NUC10 NUC10i5FNHN Mini PC/HTPC With All New Parts Assembled. Our store is HOT selling Intel NUC11 i5, i7, NUC10 i5, i7, NUC8, Barebone and Mini PC with various sizes of RAM or SSD. If you need to know more, please click on our Store Name:""GEEK + Computer Mall"" --------- ""Products"", OR click ""Visit the GEEK+ Store"" under the title.:BRK:【Quad Core Processor & Graphic 】 10th Generation Intel Core i5-10210U，1.6 GHz – 4.2 GHz Turbo, 4 core, 8 thread, 6MB Cache，25W Intel UHD Graphics, up to 1.0 GHz, 80 EU units.:BRK:【Storage Expansion Options】 Kingston 16GB DDR4 RAM"

Can someone suggest or provide a hint to improve the regex.

RegEx : Double quotes unable to escape delimiter

Answers (1)

Related Questions