Reputation: 71
I am using AWS Wrangler to read some data in and then do some checks against that data to ensure my new data doesn't have the same timestamps that exist in the original data (this may occur if the pipeline runs twice in a day as I want each row to align to a specific day), if it conflicts I drop the row in the original and concat the new data to the original and overwrite. I am wondering if there is a better way. Can I still append the data, curious to know opinions.
#pseduo code
Original_ data = wr.read.parquet()
new data = func()
count = func_where_we_get_if_new_data_timestamp_is_present_in_original()
if count = 0:
append_data(new_data)
else:
overwrite_data( func_remove_rows(old_data) + new_data)
Upvotes: 0
Views: 38