Pranav Kulkarni
Pranav Kulkarni

Reputation: 21

How to read utf-8 encoding file in Spark Scala

I am trying to read utf-8 encoding file into Spark Scala. I am doing this

val nodes = sparkContext.textFile("nodes.csv")

where the given csv file is in UTF-8, but spark converts non-english characters to ? How do I get it to read actual values? I tried it in pyspark and it works fine because pyspark's textFile() function has encoding option and by default support utf-8 (it seems).

I am sure the file is in utf-8 encoding. I did this to confirm

➜  workspace git:(f/playground) ✗ file -I nodes.csv
nodes.csv: text/plain; charset=utf-8

Upvotes: 0

Views: 2885

Answers (1)

joel
joel

Reputation: 7867

Using this post, we can read the file first then feed it to the sparkContext

val decoder = Codec.UTF8.decoder.onMalformedInput(CodingErrorAction.IGNORE)
val rdd = sc.parallelize(Source.fromFile(filename)(decoder).getLines().toList)

Upvotes: 1

Related Questions