Reputation: 1071

Writing Spark Dataframe directly to HIVE is taking too much time

I am writing 2 dataframes from Spark directly to Hive using PySpark. The first df has only one row and 7 columns. The second df has 20M rows and 20 columns. It took 10 mins to write the 1 df(1row) and around 30Mins to write 1M rows in the second DF. I dont know how long it will take to write the entire 20M, I killed the code before it can complete.

I have tried two approaches to write the df. I also cached the df to see if it would make the write faster but didn't seem to have any effect:

df_log.write.mode("append").insertInto("project_alpha.sends_log_test")

2nd Method

#df_log.registerTempTable("temp2")
#df_log.createOrReplaceTempView("temp2")
sqlContext.sql("insert into table project_alpha.sends_log_test select * from temp2")

In the 2nd approach I tried using both registerTempTable() as well as createOrReplaceTempView() but there was no difference in the run time.

Is there a way to write it faster or more efficiently. Thanks.

Upvotes: 2

Answers (3)

shibashis.behera

Reputation: 26

try repartition to small number of files lets say like .repartition(2000) and then write to hive. Large number of partitions in spark sometimes takes time to write.

Upvotes: 0

loneStar

Reputation: 4010

Its bad idea(or design) to do insert into hive table. You have to save it as file and create a table on top of it or add as a partition to existing table.

Can you please try that route.

Upvotes: 1

Adrian

Reputation: 108

Are you sure the final tables are cached? It might be the issue that before writing the data it calculates the whole pipeline. You can check that in terminal/console where Spark runs.

Also, please check if the table you append to on Hive is not a temporary view - then it could be the issue of recalculating the view before appending new rows. When I write data to Hive I always use:

df.write.saveAsTable('schema.table', mode='overwrite')

Please try:

df.write.saveAsTable('schema.table', mode='append')

Upvotes: 3

Writing Spark Dataframe directly to HIVE is taking too much time

Answers (3)

Related Questions