Shivpe_R
Shivpe_R

Reputation: 1080

pyspark Hive Context -- read table with UTF-8 encoding

I have a table in hive, And I am reading that table in pyspark df_sprk_df

from pyspark import SparkContext
from pysaprk.sql import HiveContext
sc = SparkContext()
hive_context = HiveContext(sc)
df_sprk_df = hive_context.sql('select * from databasename.tablename')
df_pandas_df = df_sprk_df.toPandas()
df_pandas_df = df_pandas_df.astype('str')

but when I try to convert df_pandas_df to astype of str. but I get error like

UnicodeEnCodeError :'ascii' codec cant encode character u'\u20ac' in position

Even I tried to convert column to str one by one as

for cols in df_pandas_df.columns:
    df_pandas_df[cols] = df_pandas_df[cols].str.encode('utf-8')

but no luck, so basically how can I import hive table to dataframe in utf-8 encoding

Upvotes: 1

Views: 6558

Answers (2)

Shivpe_R
Shivpe_R

Reputation: 1080

So this workaround helped to solve this, By changing the default encoding for the session

import sys
reload(sys)
sys.setdefaultencoding('UTF-8')

and then

df_pandas_df = df_pandas_df.astype(str)

converts whole dataframe as string df.

Upvotes: 2

vvg
vvg

Reputation: 6385

Instead of directly casting it to string try to infer types of pandas DataFrame using following statement:

df_pandas_df .apply(lambda x: pd.lib.infer_dtype(x.values))

UPD: try to perform mapping without .str invocation.

Maybe something like below:

for cols in df_pandas_df.columns:
    df_pandas_df[cols] = df_pandas_df[cols].apply(lambda x: unicode(x, errors='ignore'))

Upvotes: 0

Related Questions