Reputation: 614
I have a lot of delimited files with Text Qualifier (every column start and end has double quote). Delimited is not consistent i.e. there can be any delimited like comma(,), Pipe (|), ~, tab (\t).
I need to read this file with text (single column) and then check no of delimiters by considering Text Qualifier. If any record has less or more columns than defined that record should be rejected and loaded to different path.
Below is test data with 3 columns ID, Name and DESC. DESC column has extra delimiter.
"ID","Name","DESC"
"1" , "ABC", "A,B C"
"2" , "XYZ" , "ABC is bother"
"3" , "YYZ" , ""
4 , "XAA" , "sf,sd
sdfsf"
Last record splitted into two records due new line char in desc field
Below is the code I tried to handle but not able to handle correctly.
val SourceFileDF = spark.read.text(InputFilePath)
SourceFile = SourceFile.filter("value != ''") // Removing empty records while reading
val aCnt = coalesce(length(regexp_replace($"value","[^,]", "")), lit(0)) //to count no of delimiters
val Delimitercount = SourceFileDF.withColumn("a_cnt", aCnt)
var invalidrecords= Delimitercount
.filter(col("a_cnt")
.!==(NoOfDelimiters)).toDF()
val GoodRecordsDF = Delimitercount
.filter(col("a_cnt")
.equalTo(NoOfDelimiters)).drop("a_cnt")
With above code I am able to reject all the records which has less or more delimiters but not able to ignore if delimiter is with in text qualifier.
Thanks in Advance.
Upvotes: 1
Views: 399
Reputation: 626926
You may use a closure with replaceAllIn
to remove any chars you want inside a match:
var y = """4 , "XAA" , "sf,sd\nsdfsf""""
val pattern = """"[^"]*(?:""[^"]*)*"""".r
y = pattern replaceAllIn (y, m => m.group(0).replaceAll("[,\n]", ""))
print(y) // => 4 , "XAA" , "sfsdnsdfsf"
See the Scala demo.
Details
"
- matches a "
[^"]*
- any 0+ chars other than "
(?:""[^"]*)*
- matches 0 or more sequences of ""
and then 0+ chars other than "
"
- a "
.The code finds all non-overlapping matches of the above pattern in y
and upon finding a match (m
) the ,
and newlines (LF) are removed from the match value (with m.group(0).replaceAll("[,\n]", "")
, where m.group(0)
is the match value and [,\n]
matches either ,
or a newline).
Upvotes: 1