Victor
Victor

Reputation: 17107

Cannot create a dataframe in pyspark and write it to Hive table

I am trying to create a dataframe in pyspark, then write it as a Hive table, and then read it back, but it is not working...

sqlContext = HiveContext(sc)

hive_context = HiveContext(sc) #Initialize Hive

#load the control table 
cntl_dt = [('2016-04-30')]
rdd = sc.parallelize(cntl_dt)
row_cntl_dt = rdd.map(lambda x: Row(load_dt=x[0]))
df_cntl_dt = sqlContext.createDataFrame(row_cntl_dt)
df_cntl_dt.write.mode("overwrite").saveAsTable("schema.cntrl_tbl")
load_dt  = hive_context.sql("select load_dt  from schema.cntrl_tbl" ).first()['load_dt'];
print (load_dt)

Prints: 2

I expect :2016-12-31

Upvotes: 0

Views: 538

Answers (1)

Alper t. Turker
Alper t. Turker

Reputation: 35249

This is because:

cntl_dt = [('2016-04-30')]

is not a valid syntax for a single element tuple. Quotes will be ignored and result will be the same as:

['2016-04-30']

and

Row(load_dt=x[0])

will give:

Row(load_dt='2')

Use:

cntl_dt = [('2016-04-30', )]

Also you're mixing different context (SQLContext and HiveContext) which is generally a bad idea (and both shouldn't be used in any recent Spark version)

Upvotes: 1

Related Questions