Reputation: 594
I have written a program which is going to read from csv file using a delimiter and at the same time I have a use case where I needed to create a string from of the delimited data so for that I have created a regex to split the column data using the delimiter.
Now the challenge is when delimiter is present in in double quotes ideally I should not be splitting the data, spark is escaping that delimiter but my regex somehow are not.
private static void readFromSourceFile(SparkSession sparkSession) {
String delType = ",";
final String regex = "["+delType+ "]{"+delType.length()+"}(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)";
Dataset<Row> csv = sparkSession
.read().option("delimiter",delType)
.option("header",false)
.option("inferSchema",true)
.csv("src/main/resources/quotes2.csv");
char separator= '\u0001';
csv.show(false);
List<Row> df = csv.collectAsList();
String split[] = df.get(0).toString().split(regex);
System.out.println(split.length);
Arrays.stream(split).forEach(System.out::println);
}
The O/P for the program is -
The Red marked area is the string with double quotes, it shouldn't have split that column.
Input file csv file -
New,667.88,In Stock.,Now,true,true,B09D7MQ69X,B09D7MQ69X,NUC10i5FNHN 16GB+512GB,"Intel NUC10 NUC10i5FNHN Home & Business Desktop Mini PC,10th Generation Intel® Core™ i5-10210U, Upto 4.2 GHz, 4 core, 8 Thread, 25W Intel® UHD Graphics, 16GB RAM, 512GB PCIe SSD, Win 10 Pro 8GB RAM + 256GB SSD",false,"【Intel NUC10i5FNHN with RAM & SSD】 Intel NUC10 NUC10i5FNHN Mini PC/HTPC With All New Parts Assembled. Our store is HOT selling Intel NUC11 i5, i7, NUC10 i5, i7, NUC8, Barebone and Mini PC with various sizes of RAM or SSD. If you need to know more, please click on our Store Name:""GEEK + Computer Mall"" --------- ""Products"", OR click ""Visit the GEEK+ Store"" under the title.:BRK:【Quad Core Processor & Graphic 】 10th Generation Intel Core i5-10210U,1.6 GHz – 4.2 GHz Turbo, 4 core, 8 thread, 6MB Cache,25W Intel UHD Graphics, up to 1.0 GHz, 80 EU units.:BRK:【Storage Expansion Options】 Kingston 16GB DDR4 RAM"
Can someone suggest or provide a hint to improve the regex.
Upvotes: 0
Views: 230
Reputation: 545608
I find that splitting by complex delimiters leads to convoluted regular expressions, as your code showcases. In fact, the sub-expression "["+delType+ "]{"+delType.length()+"}"
doesn’t make a lot of sense, and I strongly suspect this is a bug in your code (for instance if delType
is <>
your code would also split on occurrences of ><
).
As an alternative, consider using a regular expression that exhaustively describes the lexical syntax of your input, and then match all tokens. This works particularly well when using named groups.
In your case (CSV with quoted fields, using doubled-up quotes to escape them), the lexical syntax of the tokens can be described by the following token types:
"…"
, where the characters in …
can be anything except "
, but they can also include ""
)As a regular expression, this can be written as follows in Java:
Pattern.compile(
"(?<delim>" + d + ")|" +
"\"(?<quotedField>(?:[^\"]|\"\")*)\"|" +
"(?<field>.*?(?:(?=" + d + ")|$))"
);
Where d
is defined as Pattern.quote(delim)
(the quoting is important, in case the delimiter contains a regex special char!).
The only slight complication here is the last token type, because we are matching everything up to the next delimiter (.*?
matches non-greedily), or until the end of the string.
Afterwards, we iterate over all matches and collect those where either the group field
or quotedField
is set. Putting it all together inside a method:
static String[] parseCsvRow(String row, String delim) {
final String d = Pattern.quote(delim);
final Pattern pattern = Pattern.compile(
"(?<delim>" + d + ")|" +
"\"(?<quotedField>(?:[^\"]|\"\")*)\"|" +
"(?<field>.*?(?:(?=" + d + ")|$))"
);
final Matcher matcher = pattern.matcher(row);
final List<String> results = new ArrayList<>();
while (matcher.find()) {
if (matcher.group("field") != null) {
results.add(matcher.group("field"));
} else if (matcher.group("quotedField") != null) {
results.add(matcher.group("quotedField").replaceAll("\"\"", "\""));
}
}
return results.toArray(new String[0]);
}
In real code I would wrap this in a CsvParser
class instead of a single method, where the constructor creates the pattern based on the delimiter so that the pattern doesn’t have to be recompiled for each row.
Upvotes: 1