All data deleted From S3 after writing with "overwrite" mode via Databricks

Question

My Agenda -

Query Hive External Table and store data in a dataframe.
Do some processing on the data frame.
Store the processed data into some other S3 Location( i.e partitioning by businessname too in dataframe write).

Important - All the above are executed on Databricks Notebook.

So , My hive table reads partitions from such location on S3 -

Hive location - s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass

Containing Partitions Like-

s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200804230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200803230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200802230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200801230000

The Writing To S3 Happens with -

data.coalesce(coalesceSize/maxFilterLength)
        .write
        .partitionBy("businessname", "ingestiontime")
        .mode(SaveMode.Overwrite)
        .format(flatDataFormat)
        .option("path", s"${flatteningBasePath}/")
        .save()

And the destination location are -

s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=abc/ingestiontime=20200804230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=abc/ingestiontime=20200803230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=abc/ingestiontime=20200802230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=abc/ingestiontime=20200801230000
.....
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=hell/ingestiontime=20200804230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=hell/ingestiontime=20200803230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=hell/ingestiontime=20200802230000
s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/businessname=hell/ingestiontime=20200801230000
....

Now, The issue is all of these happen successfully, But the original data in the hive partitions i.e -

    s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200804230000
    s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200803230000
    s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200802230000
    s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200801230000

suddenly have NO DATA in them, which is also confirmed when i query the hive table and it returns me null.

Upon looking closely, in my s3 folder ( suppose for this partition - s3://dev/gp/data/tmp/g.p/HistoryData/myDataclass/ingestiontime=20200801230000 ) -

I am really confused so as to why is this happening, I dont want to loose any data from anywhere.

All data deleted From S3 after writing with "overwrite" mode via Databricks

Answers (1)

Related Questions

All data deleted From S3 after writing with &quot;overwrite&quot; mode via Databricks

Answers (1)

Related Questions

All data deleted From S3 after writing with "overwrite" mode via Databricks