Ashika Umanga Umagiliya
Ashika Umanga Umagiliya

Reputation: 9158

Spark : Japanese letters are garbled in Paquet files created in HDFS

I have a Spark job which reads some CSV file on S3 ,process and save the result as parquet files.These CSV contains Japanese text.

When I run this job on local, reading the S3 CSV file and write to parquet files into local folder, the japanese letters looks fine.

But when I ran this on my spark cluster, reading the same S3 CSV file and write parquet to HDFS , all the Japanese letters are garbled.

run on spark-cluster (data is garbled)

spark-submit --master spark://spark-master-stg:7077 \
--conf spark.sql.session.timeZone=UTC \
--conf spark.driver.extraJavaOptions="-Ddatabase=dev_mall -Dtable=table_base_TEST -DtimestampColumn=time_stamp -DpartitionColumns= -Dyear=-1 -Dmonth=-1 -DcolRenameMap=  -DpartitionByYearMonth=true -DaddSpdbCols=false -DconvertTimeDateCols=true -Ds3AccessKey=xxxxx -Ds3SecretKey=yyyy -Ds3BasePath=s3a://bucket/export/e2e-test -Ds3Endpoint=http://s3.url -DhdfsBasePath=hdfs://nameservice1/tmp/encoding-test -DaddSpdbCols=false" \
--name Teradata_export_test_ash \
--class com.mycompany.data.spark.job.TeradataNormalTableJob \
--deploy-mode client \
https://artifactory.maven-it.com/spdb-mvn-release/com.mycompany.data/teradata-spark_2.11/0.1/teradata-spark_2.11-0.1-assembly.jar

run locally (data looks fine)

spark-submit --master local \
--conf spark.sql.session.timeZone=UTC \
--conf spark.driver.extraJavaOptions="-Ddatabase=dev_mall -Dtable=table_base_TEST -DtimestampColumn=time_stamp -DpartitionColumns= -Dyear=-1 -Dmonth=-1 -DcolRenameMap=  -DpartitionByYearMonth=true -DaddSpdbCols=false -DconvertTimeDateCols=true -Ds3AccessKey=xxxxx -Ds3SecretKey=yyyy -Ds3BasePath=s3a://bucket/export/e2e-test -Ds3Endpoint=http://s3.url -DhdfsBasePath=/tmp/encoding-test -DaddSpdbCols=false" \
--name Teradata_export_test_ash \
--class com.mycompany.data.spark.job.TeradataNormalTableJob \
--deploy-mode client \
https://artifactory.maven-it.com/spdb-mvn-release/com.mycompany.data/teradata-spark_2.11/0.1/teradata-spark_2.11-0.1-assembly.jar

As can be seen above, both spark-submit jobs points to the same S3 file, only different is when running on Spark cluster, the result is written to HDFS.

Reading CSV:

def readTeradataCSV(schema: StructType, path: String) : DataFrame = {
        dataFrameReader.option("delimiter", "\u0001")
          .option("header", "false")
          .option("inferSchema", "false")
          .option("multiLine","true")
          .option("encoding", "UTF-8")
          .option("charset", "UTF-8")
          .schema(schema)
          .csv(path)

     }

This is how I write to parquet:

finalDf.write
      .format("parquet")
      .mode(SaveMode.Append)
      .option("path", hdfsTablePath)
      .option("encoding", "UTF-8")
      .option("charset", "UTF-8")
      .partitionBy(parCols: _*)
      .save()

This is how data on HDFS looks like: enter image description here

Any tips on how to fix this ?

Does the input CSV file has to be in UTF-8 encoding ?

** Update ** Found out its not related to Parquet, rather CSV loading. Asked a seperate question here :

Spark CSV reader : garbled Japanese text and handling multilines

Upvotes: 0

Views: 1120

Answers (1)

Samson Scharfrichter
Samson Scharfrichter

Reputation: 9067

Parquet format has no option for encoding or charset cf. https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala

Hence your code has no effect:

finalDf.write
      .format("parquet")
      .option("encoding", "UTF-8")
      .option("charset", "UTF-8")
(...)

These options apply only for CSV, you should set them (or rather ONE of them since they are synonyms) when reading the source file.
Assuming you are using the Spark dataframe API to read the CSV; otherwise you are on your own.

Upvotes: 1

Related Questions