Raghunath
Raghunath

Reputation: 614

How to process Text Qualifier delimited file in scala

I have a lot of delimited files with Text Qualifier (every column start and end has double quote). Delimited is not consistent i.e. there can be any delimited like comma(,), Pipe (|), ~, tab (\t).

I need to read this file with text (single column) and then check no of delimiters by considering Text Qualifier. If any record has less or more columns than defined that record should be rejected and loaded to different path.

Below is test data with 3 columns ID, Name and DESC. DESC column has extra delimiter.

"ID","Name","DESC"
"1" , "ABC", "A,B C"
"2" , "XYZ" , "ABC is bother"
"3" , "YYZ" , ""
4 , "XAA" , "sf,sd
sdfsf"

Last record splitted into two records due new line char in desc field

Below is the code I tried to handle but not able to handle correctly.

val SourceFileDF = spark.read.text(InputFilePath)
SourceFile = SourceFile.filter("value != ''") // Removing empty records while reading
val aCnt = coalesce(length(regexp_replace($"value","[^,]", "")), lit(0)) //to count no of delimiters
val Delimitercount = SourceFileDF.withColumn("a_cnt", aCnt)
var invalidrecords= Delimitercount
                    .filter(col("a_cnt")
                    .!==(NoOfDelimiters)).toDF()
val GoodRecordsDF = Delimitercount
                .filter(col("a_cnt")
                .equalTo(NoOfDelimiters)).drop("a_cnt")

With above code I am able to reject all the records which has less or more delimiters but not able to ignore if delimiter is with in text qualifier.

Thanks in Advance.

Upvotes: 1

Views: 399

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626926

You may use a closure with replaceAllIn to remove any chars you want inside a match:

var y = """4 , "XAA" , "sf,sd\nsdfsf""""
val pattern = """"[^"]*(?:""[^"]*)*"""".r
y = pattern replaceAllIn (y, m => m.group(0).replaceAll("[,\n]", ""))
print(y) // => 4 , "XAA" , "sfsdnsdfsf"

See the Scala demo.

Details

  • " - matches a "
  • [^"]* - any 0+ chars other than "
  • (?:""[^"]*)* - matches 0 or more sequences of "" and then 0+ chars other than "
  • " - a ".

The code finds all non-overlapping matches of the above pattern in y and upon finding a match (m) the , and newlines (LF) are removed from the match value (with m.group(0).replaceAll("[,\n]", ""), where m.group(0) is the match value and [,\n] matches either , or a newline).

Upvotes: 1

Related Questions