Reputation: 123
One of the fields in our data is in a non-English language (Thai). We can load the data into HDFS and the system displays the non-English field correctly when we run:
hadoop fs -cat /datafile.txt
However, when we use Spark to load and display the data, all the non-English data shows ??????????????
We have added the following when we run Spark:
System.setProperty("file.encoding", "UTF-8")
Has anyone else seen this? What do I need to do to use non-English data in Spark?
We are running Spark 1.3.0, Scala 2.10.4 on Ubuntu 14.04.
Command that we run to test is:
val textFile = sc.textFile(inputFileName)
textFile.take(10).foreach(println)
Upvotes: 1
Views: 1999
Reputation: 123
We are running Spark on Docker and the problem was to do with setting the locale
.
To set the locale
on Docker, you need to update-locale
then use source /etc/default/locale
. Restarting Docker will not do this for you.
Thanks @lmm for the inspiration.
Upvotes: 1