Reputation: 1080
I have a table in hive, And I am reading that table in pyspark df_sprk_df
from pyspark import SparkContext
from pysaprk.sql import HiveContext
sc = SparkContext()
hive_context = HiveContext(sc)
df_sprk_df = hive_context.sql('select * from databasename.tablename')
df_pandas_df = df_sprk_df.toPandas()
df_pandas_df = df_pandas_df.astype('str')
but when I try to convert df_pandas_df to astype of str. but I get error like
UnicodeEnCodeError :'ascii' codec cant encode character u'\u20ac' in position
Even I tried to convert column to str one by one as
for cols in df_pandas_df.columns:
df_pandas_df[cols] = df_pandas_df[cols].str.encode('utf-8')
but no luck, so basically how can I import hive table to dataframe in utf-8 encoding
Upvotes: 1
Views: 6558
Reputation: 1080
So this workaround helped to solve this, By changing the default encoding for the session
import sys
reload(sys)
sys.setdefaultencoding('UTF-8')
and then
df_pandas_df = df_pandas_df.astype(str)
converts whole dataframe as string df.
Upvotes: 2
Reputation: 6385
Instead of directly casting it to string try to infer types of pandas DataFrame using following statement:
df_pandas_df .apply(lambda x: pd.lib.infer_dtype(x.values))
UPD:
try to perform mapping without .str
invocation.
Maybe something like below:
for cols in df_pandas_df.columns:
df_pandas_df[cols] = df_pandas_df[cols].apply(lambda x: unicode(x, errors='ignore'))
Upvotes: 0