Reputation: 166
I'm working with the data saved using alteryx workflow to s3 and then we are using pyspark to write that data in redshift
Now problem is
Its encoded using ANSI/OEM Japnese shift-jis as shown below
Now While reading it in pyspark with shift-jis encoding it still failes to parse data properly, below is the example
So My question is, is ANSI/OEM shift_jis is different than the shift-jis used in pyspark, and f so what can be used to resolve the issue
Upvotes: 0
Views: 43
Reputation: 3076
you can use the encoding='cp932'
encoding when reading the file.
The key difference might be that PySpark is using Shift-JIS encoding, but your data seems to be encoded in a Windows-specific variant like cp932 (ANSI/OEM Shift-JIS). try using cp932
as the encoding in PySpark and see if it resolves the issue.
try this:
df_pd = pd.read_csv('s3://path_to_file.csv', encoding='cp932')
or
df = spark.read.option("bom", True).csv('s3://path_to_file.csv', encoding='cp932', header=True)
Upvotes: 0