Manish Visave
Manish Visave

Reputation: 166

Pyspark Encoding

I'm working with the data saved using alteryx workflow to s3 and then we are using pyspark to write that data in redshift

Now problem is

Its encoded using ANSI/OEM Japnese shift-jis as shown below

enter image description here

Now While reading it in pyspark with shift-jis encoding it still failes to parse data properly, below is the example enter image description here

was parsed to enter image description here

while enter image description here

Is getting properly parsed enter image description here

So My question is, is ANSI/OEM shift_jis is different than the shift-jis used in pyspark, and f so what can be used to resolve the issue

Upvotes: 0

Views: 43

Answers (1)

Vaibhav K
Vaibhav K

Reputation: 3076

you can use the encoding='cp932' encoding when reading the file.

The key difference might be that PySpark is using Shift-JIS encoding, but your data seems to be encoded in a Windows-specific variant like cp932 (ANSI/OEM Shift-JIS). try using cp932 as the encoding in PySpark and see if it resolves the issue.

try this:

df_pd = pd.read_csv('s3://path_to_file.csv', encoding='cp932')

or

df = spark.read.option("bom", True).csv('s3://path_to_file.csv', encoding='cp932', header=True)

Upvotes: 0

Related Questions