Reputation: 17107
I am trying to create a dataframe in pyspark, then write it as a Hive table, and then read it back, but it is not working...
sqlContext = HiveContext(sc)
hive_context = HiveContext(sc) #Initialize Hive
#load the control table
cntl_dt = [('2016-04-30')]
rdd = sc.parallelize(cntl_dt)
row_cntl_dt = rdd.map(lambda x: Row(load_dt=x[0]))
df_cntl_dt = sqlContext.createDataFrame(row_cntl_dt)
df_cntl_dt.write.mode("overwrite").saveAsTable("schema.cntrl_tbl")
load_dt = hive_context.sql("select load_dt from schema.cntrl_tbl" ).first()['load_dt'];
print (load_dt)
Prints: 2
I expect :2016-12-31
Upvotes: 0
Views: 538
Reputation: 35249
This is because:
cntl_dt = [('2016-04-30')]
is not a valid syntax for a single element tuple
. Quotes will be ignored and result will be the same as:
['2016-04-30']
and
Row(load_dt=x[0])
will give:
Row(load_dt='2')
Use:
cntl_dt = [('2016-04-30', )]
Also you're mixing different context (SQLContext
and HiveContext
) which is generally a bad idea (and both shouldn't be used in any recent Spark version)
Upvotes: 1