codeBarer
codeBarer

Reputation: 2388

How can you perform a Insert overwrite using Spark?

I'm trying to transition one of our ETL Hive script to Spark where the Hive ETL script maintains a table where part of data needs to be deleted every night before the new sync. The Hive ETL takes the main table deletes data that in greater than 3 days using insert overwrite. Basically creates a temp table with data that doesn't surpass greater than three days and then overwrites the main table.

With Spark (using Scala) I keep getting this error where I cannot write to the same source. Here's my code:

spark.sql ("Select * from mytbl_hive where dt > date_sub(current_date, 3)").registerTempTable("tmp_mytbl")

val mytbl = sqlContext.table("tmp_mytbl")
mytbl.write.mode("overwrite").saveTableAs("tmp_mytbl")

//writing back to Hive ...

mytbl.write.mode("overwrite").insertInto("mytbl_hive")

I get the error that I cannot write to the table I'm reading from.

Does anyone know of a better way of doing this?

Upvotes: 3

Views: 8268

Answers (1)

user8705929
user8705929

Reputation: 76

You cannot. As you've learned Spark explicitly prohibits overwriting table, which is used as a source for the query. While there exist some workarounds depending on the technicalities, there are not reliable and should be avoided.

Instead:

  • Write data to a temporary table.
  • Drop old table.
  • Rename temporary table.

The Hive ETL takes the main table deletes data that in greater than 3 days using insert overwrite.

It might a better idea to partition data by date, and just drop partitions, without even looking at the data.

Upvotes: 6

Related Questions