Reputation: 596
My Agenda -
Important - All the above are executed on Databricks Notebook.
So , My hive table reads partitions from such location on S3 -
Hive location - s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass
Containing Partitions Like-
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200804230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200803230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200802230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200801230000
The Writing To S3 Happens with -
data.coalesce(coalesceSize/maxFilterLength)
.write
.partitionBy("businessname", "ingestiontime")
.mode(SaveMode.Overwrite)
.format(flatDataFormat)
.option("path", s"${flatteningBasePath}/")
.save()
And the destination location are -
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=abc/ingestiontime=20200804230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=abc/ingestiontime=20200803230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=abc/ingestiontime=20200802230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=abc/ingestiontime=20200801230000
.....
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=hell/ingestiontime=20200804230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=hell/ingestiontime=20200803230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=hell/ingestiontime=20200802230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=hell/ingestiontime=20200801230000
....
Now, The issue is all of these happen successfully, But the original data in the hive partitions i.e -
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200804230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200803230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200802230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200801230000
suddenly have NO DATA in them, which is also confirmed when i query the hive table and it returns me null
.
Upon looking closely, in my s3 folder ( suppose for this partition - s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200801230000
) -
I am really confused so as to why is this happening, I dont want to loose any data from anywhere.
Upvotes: 0
Views: 1059
Reputation: 596
This was caused because of Spark Dynamic Partitioning. I was able to control this using the following property set correctly -
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
Upvotes: 2