pyspark Hive Context -- read table with UTF-8 encoding

Question

I have a table in hive, And I am reading that table in pyspark df_sprk_df

from pyspark import SparkContext
from pysaprk.sql import HiveContext
sc = SparkContext()
hive_context = HiveContext(sc)
df_sprk_df = hive_context.sql('select * from databasename.tablename')
df_pandas_df = df_sprk_df.toPandas()
df_pandas_df = df_pandas_df.astype('str')

but when I try to convert df_pandas_df to astype of str. but I get error like

UnicodeEnCodeError :'ascii' codec cant encode character u'\u20ac' in position

Even I tried to convert column to str one by one as

for cols in df_pandas_df.columns:
    df_pandas_df[cols] = df_pandas_df[cols].str.encode('utf-8')

but no luck, so basically how can I import hive table to dataframe in utf-8 encoding

Shivpe_R · Accepted Answer

So this workaround helped to solve this, By changing the default encoding for the session

import sys
reload(sys)
sys.setdefaultencoding('UTF-8')

and then

df_pandas_df = df_pandas_df.astype(str)

converts whole dataframe as string df.

pyspark Hive Context -- read table with UTF-8 encoding

Answers (2)

Related Questions