Golokesh Patra
Golokesh Patra

Reputation: 596

All data deleted From S3 after writing with "overwrite" mode via Databricks

My Agenda -

  1. Query Hive External Table and store data in a dataframe.
  2. Do some processing on the data frame.
  3. Store the processed data into some other S3 Location( i.e partitioning by businessname too in dataframe write).

Important - All the above are executed on Databricks Notebook.

So , My hive table reads partitions from such location on S3 -

Hive location - s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass

Containing Partitions Like-

s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200804230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200803230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200802230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200801230000

The Writing To S3 Happens with -

data.coalesce(coalesceSize/maxFilterLength)
        .write
        .partitionBy("businessname", "ingestiontime")
        .mode(SaveMode.Overwrite)
        .format(flatDataFormat)
        .option("path", s"${flatteningBasePath}/")
        .save()

And the destination location are -

s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=abc/ingestiontime=20200804230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=abc/ingestiontime=20200803230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=abc/ingestiontime=20200802230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=abc/ingestiontime=20200801230000
.....
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=hell/ingestiontime=20200804230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=hell/ingestiontime=20200803230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=hell/ingestiontime=20200802230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=hell/ingestiontime=20200801230000
....

Now, The issue is all of these happen successfully, But the original data in the hive partitions i.e -

    s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200804230000
    s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200803230000
    s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200802230000
    s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200801230000

suddenly have NO DATA in them, which is also confirmed when i query the hive table and it returns me null.

Upon looking closely, in my s3 folder ( suppose for this partition - s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200801230000 ) -

enter image description here

I am really confused so as to why is this happening, I dont want to loose any data from anywhere.

Upvotes: 0

Views: 1059

Answers (1)

Golokesh Patra
Golokesh Patra

Reputation: 596

This was caused because of Spark Dynamic Partitioning. I was able to control this using the following property set correctly -

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")

Upvotes: 2

Related Questions