Reputation: 1117
I have a spark dataframe based on which I am trying to create a partitioned table in hive.
I have a flag to say if table exists or not. The first run should create the table and from second run onwards the data should be inserted into the table without overwriting existing data.
My question is how to create a partitioned table and insert into the already existing partitioned table without overriding existing data.
The table is partitioned by a column called date.
What I tried so far.(without partitions)
df.createOrReplaceTempView("df_view")
if table_exists:
spark.sql("insert into mytable select * from df_view")
else:
spark.sql("create table if not exists mytable as select * from df_view")
But I have to do the same with partitioned column - date.
There can be multiple runs for same date. So is it possible to append the data to same partition instead of overwriting it.
Expected output: After 1st run: table should be created with partition column as date.
Name date timestamp
A. 2021-09-16 2021-09-16 12:00:01
B. 2021-09-16 2021-09-16 12:00:01
After second run on same date:(data should get appended to same partition)
Name date timestamp
A. 2021-09-16 2021-09-16 12:00:01
B. 2021-09-16 2021-09-16 12:00:01
A. 2021-09-16 2021-09-16 12:20:01
B. 2021-09-16 2021-09-16 12:20:01
Third run on next date: (new partition should be created by retaining all existing data)
Name date timestamp
A. 2021-09-16 2021-09-16 12:00:01
B. 2021-09-16 2021-09-16 12:00:01
A. 2021-09-16 2021-09-16 12:20:01
B. 2021-09-16 2021-09-16 12:20:01
A. 2021-09-17 2021-09-17 12:20:01
B. 2021-09-17 2021-09-17 12:20:01
How to achieve this in Pyspark.
Upvotes: 1
Views: 1065
Reputation: 4069
Following the documentation your code might be something like this:
df.write.saveAsTable('[table_name_here]',
format='[format_here]',
mode='append',
partitionBy='date')
This code there's no needed to check if table exists, append
automatically creates if not exists.
Upvotes: 0
Reputation: 6082
you can run parts = spark.sql('show partitions mytable')
, even convert it to a Python list or Pandas and check if the partition exists
Upvotes: 0