Spark utf 8 error, non-English data becomes `??????????`

Question

One of the fields in our data is in a non-English language (Thai). We can load the data into HDFS and the system displays the non-English field correctly when we run:

hadoop fs -cat /datafile.txt

However, when we use Spark to load and display the data, all the non-English data shows ??????????????

We have added the following when we run Spark:

System.setProperty("file.encoding", "UTF-8")

Has anyone else seen this? What do I need to do to use non-English data in Spark?

We are running Spark 1.3.0, Scala 2.10.4 on Ubuntu 14.04.

Command that we run to test is:

val textFile = sc.textFile(inputFileName)
textFile.take(10).foreach(println)

Phakin · Accepted Answer

We are running Spark on Docker and the problem was to do with setting the locale.

To set the locale on Docker, you need to update-locale then use source /etc/default/locale. Restarting Docker will not do this for you.

Thanks @lmm for the inspiration.

Spark utf 8 error, non-English data becomes `??????????`

Answers (1)

Related Questions