Reputation: 9158
I have a Spark job which reads some CSV file on S3 ,process and save the result as parquet files.These CSV contains Japanese text.
When I run this job on local, reading the S3 CSV file and write to parquet files into local folder, the japanese letters looks fine.
But when I ran this on my spark cluster, reading the same S3 CSV file and write parquet to HDFS , all the Japanese letters are garbled.
run on spark-cluster (data is garbled)
spark-submit --master spark://spark-master-stg:7077 \
--conf spark.sql.session.timeZone=UTC \
--conf spark.driver.extraJavaOptions="-Ddatabase=dev_mall -Dtable=table_base_TEST -DtimestampColumn=time_stamp -DpartitionColumns= -Dyear=-1 -Dmonth=-1 -DcolRenameMap= -DpartitionByYearMonth=true -DaddSpdbCols=false -DconvertTimeDateCols=true -Ds3AccessKey=xxxxx -Ds3SecretKey=yyyy -Ds3BasePath=s3a://bucket/export/e2e-test -Ds3Endpoint=http://s3.url -DhdfsBasePath=hdfs://nameservice1/tmp/encoding-test -DaddSpdbCols=false" \
--name Teradata_export_test_ash \
--class com.mycompany.data.spark.job.TeradataNormalTableJob \
--deploy-mode client \
https://artifactory.maven-it.com/spdb-mvn-release/com.mycompany.data/teradata-spark_2.11/0.1/teradata-spark_2.11-0.1-assembly.jar
run locally (data looks fine)
spark-submit --master local \
--conf spark.sql.session.timeZone=UTC \
--conf spark.driver.extraJavaOptions="-Ddatabase=dev_mall -Dtable=table_base_TEST -DtimestampColumn=time_stamp -DpartitionColumns= -Dyear=-1 -Dmonth=-1 -DcolRenameMap= -DpartitionByYearMonth=true -DaddSpdbCols=false -DconvertTimeDateCols=true -Ds3AccessKey=xxxxx -Ds3SecretKey=yyyy -Ds3BasePath=s3a://bucket/export/e2e-test -Ds3Endpoint=http://s3.url -DhdfsBasePath=/tmp/encoding-test -DaddSpdbCols=false" \
--name Teradata_export_test_ash \
--class com.mycompany.data.spark.job.TeradataNormalTableJob \
--deploy-mode client \
https://artifactory.maven-it.com/spdb-mvn-release/com.mycompany.data/teradata-spark_2.11/0.1/teradata-spark_2.11-0.1-assembly.jar
As can be seen above, both spark-submit jobs points to the same S3 file, only different is when running on Spark cluster, the result is written to HDFS.
Reading CSV:
def readTeradataCSV(schema: StructType, path: String) : DataFrame = {
dataFrameReader.option("delimiter", "\u0001")
.option("header", "false")
.option("inferSchema", "false")
.option("multiLine","true")
.option("encoding", "UTF-8")
.option("charset", "UTF-8")
.schema(schema)
.csv(path)
}
This is how I write to parquet:
finalDf.write
.format("parquet")
.mode(SaveMode.Append)
.option("path", hdfsTablePath)
.option("encoding", "UTF-8")
.option("charset", "UTF-8")
.partitionBy(parCols: _*)
.save()
This is how data on HDFS looks like:
Any tips on how to fix this ?
Does the input CSV file has to be in UTF-8 encoding ?
** Update ** Found out its not related to Parquet, rather CSV loading. Asked a seperate question here :
Spark CSV reader : garbled Japanese text and handling multilines
Upvotes: 0
Views: 1120
Reputation: 9067
Parquet format has no option for encoding
or charset
cf. https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala
Hence your code has no effect:
finalDf.write
.format("parquet")
.option("encoding", "UTF-8")
.option("charset", "UTF-8")
(...)
These options apply only for CSV, you should set them (or rather ONE of them since they are synonyms) when reading the source file.
Assuming you are using the Spark dataframe API to read the CSV; otherwise you are on your own.
Upvotes: 1