Regressor
Regressor

Reputation: 1973

How to append new data to existing Hive table using Spark data frame and partitionby clause

I have a dataframe which I am writing to Hive table using partitionBy -

val df = Seq(
("day1", 1),
("day1", 2),
("day1", 3),
("day1", 4)
).toDF("day","id")

df.write.partitionBy("day").format("parquet").saveAsTable("mydb.day_table")

spark.sql("select * from mydb.day_table").show
+---+----+
| id| day|
+---+----+
|  1|day1|
|  2|day1|
|  3|day1|
|  4|day1|
+---+----+

spark.sql("show create table mydb.day_table").show(false)
+---------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt                                                                                                                         |
+---------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `mydb`.`day_table` (`id` INT, `day` STRING)
USING parquet
OPTIONS (
  `serialization.format` '1'
)
PARTITIONED BY (day)
|
+---------------------------------------------------------------------------------------------------------------------------------------+

If I create another dataframe and want to append the content of this data frame to the same table by keeping the partition intact, then how do I do it? Is this the correct way ?

val df2 = Seq(
("day2",5),
("day2",6)
).toDF("day","id")


df2.write.mode("append").partitionBy("day").format("parquet").saveAsTable("mydb.day_table")

I want below output but I want the table to be partitioned by day even after I keep appending the data to the original table.

spark.sql("select * from mydb.day_table").show(false)
+---+----+
|id |day |
+---+----+
|1  |day1|
|2  |day1|
|3  |day1|
|4  |day1|
|5  |day2|
|6  |day2|
+---+----+

I am using Scala with Spark-Shell. Thanks.

Upvotes: 0

Views: 4024

Answers (1)

Vijay Krishna
Vijay Krishna

Reputation: 1067

You have to use spark data frame api's

1)partitionby

2)saveAsTable

df.\
write.\
format("parquet").\
mode(org.apache.spark.sql.SaveMode.Append).\
partitionBy("first").\
saveAsTable("example")

I see This is what you were doing as well. But you did not post any error.

Upvotes: 1

Related Questions