Spark : Japanese letters are garbled in Paquet files created in HDFS

Question

I have a Spark job which reads some CSV file on S3 ,process and save the result as parquet files.These CSV contains Japanese text.

When I run this job on local, reading the S3 CSV file and write to parquet files into local folder, the japanese letters looks fine.

But when I ran this on my spark cluster, reading the same S3 CSV file and write parquet to HDFS , all the Japanese letters are garbled.

run on spark-cluster (data is garbled)

spark-submit --master spark://spark-master-stg:7077 \
--conf spark.sql.session.timeZone=UTC \
--conf spark.driver.extraJavaOptions="-Ddatabase=dev_mall -Dtable=table_base_TEST -DtimestampColumn=time_stamp -DpartitionColumns= -Dyear=-1 -Dmonth=-1 -DcolRenameMap=  -DpartitionByYearMonth=true -DaddSpdbCols=false -DconvertTimeDateCols=true -Ds3AccessKey=xxxxx -Ds3SecretKey=yyyy -Ds3BasePath=s3a://bucket/export/e2e-test -Ds3Endpoint=http://s3.url -DhdfsBasePath=hdfs://nameservice1/tmp/encoding-test -DaddSpdbCols=false" \
--name Teradata_export_test_ash \
--class com.mycompany.data.spark.job.TeradataNormalTableJob \
--deploy-mode client \
https://artifactory.maven-it.com/spdb-mvn-release/com.mycompany.data/teradata-spark_2.11/0.1/teradata-spark_2.11-0.1-assembly.jar

run locally (data looks fine)

spark-submit --master local \
--conf spark.sql.session.timeZone=UTC \
--conf spark.driver.extraJavaOptions="-Ddatabase=dev_mall -Dtable=table_base_TEST -DtimestampColumn=time_stamp -DpartitionColumns= -Dyear=-1 -Dmonth=-1 -DcolRenameMap=  -DpartitionByYearMonth=true -DaddSpdbCols=false -DconvertTimeDateCols=true -Ds3AccessKey=xxxxx -Ds3SecretKey=yyyy -Ds3BasePath=s3a://bucket/export/e2e-test -Ds3Endpoint=http://s3.url -DhdfsBasePath=/tmp/encoding-test -DaddSpdbCols=false" \
--name Teradata_export_test_ash \
--class com.mycompany.data.spark.job.TeradataNormalTableJob \
--deploy-mode client \
https://artifactory.maven-it.com/spdb-mvn-release/com.mycompany.data/teradata-spark_2.11/0.1/teradata-spark_2.11-0.1-assembly.jar

As can be seen above, both spark-submit jobs points to the same S3 file, only different is when running on Spark cluster, the result is written to HDFS.

Reading CSV:

def readTeradataCSV(schema: StructType, path: String) : DataFrame = {
        dataFrameReader.option("delimiter", "\u0001")
          .option("header", "false")
          .option("inferSchema", "false")
          .option("multiLine","true")
          .option("encoding", "UTF-8")
          .option("charset", "UTF-8")
          .schema(schema)
          .csv(path)

     }

This is how I write to parquet:

finalDf.write
      .format("parquet")
      .mode(SaveMode.Append)
      .option("path", hdfsTablePath)
      .option("encoding", "UTF-8")
      .option("charset", "UTF-8")
      .partitionBy(parCols: _*)
      .save()

This is how data on HDFS looks like:

Any tips on how to fix this ?

Does the input CSV file has to be in UTF-8 encoding ?

** Update ** Found out its not related to Parquet, rather CSV loading. Asked a seperate question here :

Spark CSV reader : garbled Japanese text and handling multilines

Spark : Japanese letters are garbled in Paquet files created in HDFS

Answers (1)

Related Questions