sky
sky

Reputation: 23

Parse CSV file in Scala

I am trying to load a CSV file that has Japanese characters into a dataframe in scala. When I read a column value as "セキュリティ対策ウェビナー開催中】受講登録でスグに役立つ「e-Book」を進呈!" which is supposed to go in one column only, it breaks the string at "」"(considers this as new line) and creates two records. I have set the "charset" property to UTF-16 also, quote character is "\"", still it showing more records than the file.

val df = spark.read.option("sep", "\t").option("header", "true").option("charset","UTF-16").option("inferSchema", "true").csv("file.txt")

Any pointer on how to solve this would be very helpful.

Upvotes: 1

Views: 902

Answers (2)

curious_me
curious_me

Reputation: 26

Looks like there's a new line character in your Japanese string. Can you try using the multiLine option while reading the file?

var data = spark.read.format("csv")
 .option("header","true")
 .option("delimiter", "\n")
 .option("charset", "utf-16")
 .option("inferSchema", "true")
 .option("multiLine", true)
 .load(filePath)

Note: As per the below answer there are some concerns with this approach when the input file is very big. How to handle multi line rows in spark?

Upvotes: 1

KZapagol
KZapagol

Reputation: 928

The below code should work for UTF-16. I couldn't able to set csv file encoding UTF-16 in Notepad++ and hence I have tested it with UTF-8. Please make sure that you have set input file encoding which is UTF-16.

Code snippet :

val br = new BufferedReader(
    new InputStreamReader(
      new FileInputStream("C:/Users/../Desktop/csvFile.csv"), "UTF-16"));

  for(line <- br.readLine()){
    print(line)
  }

  br.close();

csvFile content used:

【セキュリティ対策ウェビナー開催中】受講登録でスグに役立つ「e-Book」を進呈!,January, セキュリティ, 開催, 1000.00

Update:

If you want to load using spark then you can load csv file as below.

spark.read
      .format("com.databricks.spark.csv")
      .option("charset", "UTF-16")
      .option("header", "false")
      .option("escape", "\\")
      .option("delimiter", ",")
      .option("inferSchema", "false")
      .load(fromPath)

Sample Input file for above code:

  "102","03","セキュリティ対策ウェビナー開催中】受講登録でスグに役立つ「e-Book」を進呈!","カグラアカガワヤツキヨク","セキュリティ","受講登録でス"

Upvotes: 1

Related Questions