What's the right way to insert DF to Hive Internal table in Append Mode. It seems we can directly write the DF to Hive using "saveAsTable" method OR store the DF to temp table then use the query. df.write().mode("append").saveAsTable("tableName") OR df.registerTempTable("temptable") sqlContext.sql("CREATE TABLE IF NOT EXISTS mytable as select * from temptable") Will the second approach append the records or overwrite it? Is there any other way to effectively write the DF to Hive Internal table?

Reputation: 8967

How to insert Spark DataFrame to Hive Internal table?

What's the right way to insert DF to Hive Internal table in Append Mode. It seems we can directly write the DF to Hive using "saveAsTable" method OR store the DF to temp table then use the query.

df.write().mode("append").saveAsTable("tableName")

df.registerTempTable("temptable") 
sqlContext.sql("CREATE TABLE IF NOT EXISTS mytable as select * from temptable")

Will the second approach append the records or overwrite it?

Is there any other way to effectively write the DF to Hive Internal table?

Upvotes: 10

Answers (3)

uh_big_mike_boi

Reputation: 3470

You could also insert and just overwrite the partition you are inserting into and you could do it with dynamic partitioning.

spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")

temp_table = "tmp_{}".format(table)
df.createOrReplaceTempView(temp_table)
spark.sql("""
    insert overwrite table `{schema}`.`{table}`
    partition (partCol1, partCol2)
      select col1       
           , col2       
           , col3       
           , col4   
           , partCol1
           , partCol2
    from {temp_table}
""".format(schema=schema, table=table, temp_table=temp_table))

Upvotes: 0

Sandeep Singh

Reputation: 8010

df.saveAsTable("tableName", "append") is deprecated. Instead you should the second approach.

sqlContext.sql("CREATE TABLE IF NOT EXISTS mytable as select * from temptable")

It will create table if the table doesnot exist. When you will run your code second time you need to drop the existing table otherwise your code will exit with exception.

Another approach, If you don't want to drop table. Create a table separately, then insert your data into that table.

The below code will append data into existing table

sqlContext.sql("insert into table mytable select * from temptable")

And the below code will overwrite the data into existing table

sqlContext.sql("insert overwrite table mytable select * from temptable")

This answer is based on Spark 1.6.2. In case you are using other version of Spark I would suggests to check the appropriate documentation.

Upvotes: 9

Hansang

Reputation: 1622

Neither of the options here worked for me/probably depreciated since the answer was written.

According to the latest spark API docs (for Spark 2.1), it's using the insertInto() method from the DataFrameWriterclass

I'm using the Python PySpark API but it would be the same in Scala:

df.write.insertInto(target_db.target_table,overwrite = False)

The above worked for me.

Upvotes: 21

How to insert Spark DataFrame to Hive Internal table?

Answers (3)

Related Questions