Parse CSV file in Scala

Question

I am trying to load a CSV file that has Japanese characters into a dataframe in scala. When I read a column value as "セキュリティ対策ウェビナー開催中】受講登録でスグに役立つ「e-Book」を進呈!" which is supposed to go in one column only, it breaks the string at "」"(considers this as new line) and creates two records. I have set the "charset" property to UTF-16 also, quote character is """, still it showing more records than the file.

val df = spark.read.option("sep", "	").option("header", "true").option("charset","UTF-16").option("inferSchema", "true").csv("file.txt")

Any pointer on how to solve this would be very helpful.

curious_me · Accepted Answer

Looks like there's a new line character in your Japanese string. Can you try using the multiLine option while reading the file?

var data = spark.read.format("csv")
 .option("header","true")
 .option("delimiter", "
")
 .option("charset", "utf-16")
 .option("inferSchema", "true")
 .option("multiLine", true)
 .load(filePath)

Note: As per the below answer there are some concerns with this approach when the input file is very big. How to handle multi line rows in spark?

Parse CSV file in Scala

Answers (2)

Related Questions