Reputation: 1071
I am writing 2 dataframes from Spark directly to Hive using PySpark
. The first df has only one row and 7 columns. The second df has 20M rows and 20 columns. It took 10 mins to write the 1 df(1row) and around 30Mins to write 1M rows in the second DF. I dont know how long it will take to write the entire 20M, I killed the code before it can complete.
I have tried two approaches to write the df. I also cached the df to see if it would make the write faster but didn't seem to have any effect:
df_log.write.mode("append").insertInto("project_alpha.sends_log_test")
2nd Method
#df_log.registerTempTable("temp2")
#df_log.createOrReplaceTempView("temp2")
sqlContext.sql("insert into table project_alpha.sends_log_test select * from temp2")
In the 2nd approach I tried using both registerTempTable()
as well as createOrReplaceTempView()
but there was no difference in the run time.
Is there a way to write it faster or more efficiently. Thanks.
Upvotes: 2
Views: 7785
Reputation: 26
try repartition to small number of files lets say like .repartition(2000) and then write to hive. Large number of partitions in spark sometimes takes time to write.
Upvotes: 0
Reputation: 4010
Its bad idea(or design) to do insert into hive table. You have to save it as file and create a table on top of it or add as a partition to existing table.
Can you please try that route.
Upvotes: 1
Reputation: 108
Are you sure the final tables are cached? It might be the issue that before writing the data it calculates the whole pipeline. You can check that in terminal/console where Spark runs.
Also, please check if the table you append to on Hive is not a temporary view - then it could be the issue of recalculating the view before appending new rows. When I write data to Hive I always use:
df.write.saveAsTable('schema.table', mode='overwrite')
Please try:
df.write.saveAsTable('schema.table', mode='append')
Upvotes: 3