Reputation: 23
I am trying to load a CSV file that has Japanese characters into a dataframe in scala. When I read a column value as "セキュリティ対策ウェビナー開催中】受講登録でスグに役立つ「e-Book」を進呈!"
which is supposed to go in one column only, it breaks the string at "」"
(considers this as new line) and creates two records.
I have set the "charset" property to UTF-16 also, quote character is "\"", still it showing more records than the file.
val df = spark.read.option("sep", "\t").option("header", "true").option("charset","UTF-16").option("inferSchema", "true").csv("file.txt")
Any pointer on how to solve this would be very helpful.
Upvotes: 1
Views: 902
Reputation: 26
Looks like there's a new line character in your Japanese string. Can you try using the multiLine option while reading the file?
var data = spark.read.format("csv")
.option("header","true")
.option("delimiter", "\n")
.option("charset", "utf-16")
.option("inferSchema", "true")
.option("multiLine", true)
.load(filePath)
Note: As per the below answer there are some concerns with this approach when the input file is very big. How to handle multi line rows in spark?
Upvotes: 1
Reputation: 928
The below code should work for UTF-16. I couldn't able to set csv file encoding UTF-16 in Notepad++ and hence I have tested it with UTF-8. Please make sure that you have set input file encoding which is UTF-16.
Code snippet :
val br = new BufferedReader(
new InputStreamReader(
new FileInputStream("C:/Users/../Desktop/csvFile.csv"), "UTF-16"));
for(line <- br.readLine()){
print(line)
}
br.close();
csvFile content used:
【セキュリティ対策ウェビナー開催中】受講登録でスグに役立つ「e-Book」を進呈!,January, セキュリティ, 開催, 1000.00
Update:
If you want to load using spark then you can load csv file as below.
spark.read
.format("com.databricks.spark.csv")
.option("charset", "UTF-16")
.option("header", "false")
.option("escape", "\\")
.option("delimiter", ",")
.option("inferSchema", "false")
.load(fromPath)
Sample Input file for above code:
"102","03","セキュリティ対策ウェビナー開催中】受講登録でスグに役立つ「e-Book」を進呈!","カグラアカガワヤツキヨク","セキュリティ","受講登録でス"
Upvotes: 1