Best practices to prevent duplicate data (pandas/postgres)

Question

I am trying to prevent duplicate data from being added to a database based on three columns. In pandas, this is simple because I can do drop_duplicates(subset = ['customer_id', 'call_time', 'employee_id'], inplace = True, keep = 'last']

I recently installed postgresql and am loading data with SQLAlchemy with to_sql.

Is it logical to read from my database at the start of the file, find the maximum date that exists in the file, and then only read/modify new data which occurs after that date? Say the max date in the file is 12/04, then I would add files which are 12/05 onward (max_date + pd.Timedelta(days = 1)) only to make sure I am not adding duplicate information.

My current setup, since I am only using .csv files, is to always read the full month (current month) of data to be safe. Then I overwrite a month-to-date csv file over and over again every day, adding one new day of data each time.

start_date = (central_time - pd.Timedelta(days = 0))- pd.offsets.MonthBegin()
end_date = central_time
days = ((pd.Timestamp(end_date) - pd.Timestamp(start_date)) / np.timedelta64(1, 'D')).astype(int)

Lukasz Szozda · Accepted Answer

You can add UNIQUE constraint to your colums like:

CREATE TABLE tab(id SERIAL PRIMARY KEY,
                 customer_id INT,
                 call_time TIMESTAMP,
                 employee_id INT,
                 note VARCHAR(100),
                 CONSTRAINT uc_tab UNIQUE (customer_id, call_time, employee_id)
            );

And use INSERT INTO ... ON CONFLICT syntax instead:

The optional ON CONFLICT clause specifies an alternative action to raising a unique violation or exclusion constraint violation error. For each individual row proposed for insertion, either the insertion proceeds, or, if an arbiter constraint or index specified by conflict_target is violated, the alternative conflict_action is taken. ON CONFLICT DO NOTHING simply avoids inserting a row as its alternative action. ON CONFLICT DO UPDATE updates the existing row that conflicts with the row proposed for insertion as its alternative action.

INSERT INTO tab(customer_id, call_time, employee_id, note)
VALUES (1, TO_TIMESTAMP('2015-12-01','YYYY-MM-DD'), 1, 'Note1');


INSERT INTO tab(customer_id, call_time, employee_id, note)
VALUES (1, TO_TIMESTAMP('2015-12-01','YYYY-MM-DD'), 1, 'Note2');
-- ERROR: duplicate key value violates unique constraint "uc_tab" Detail: 
-- Key (customer_id, call_time, employee_id)=(1, 2015-12-01 00:00:00, 1)
-- already exists.

INSERT INTO tab(customer_id, call_time, employee_id, note)
VALUES (1, TO_TIMESTAMP('2015-12-01','YYYY-MM-DD'), 1, 'Note2') 
ON CONFLICT ON CONSTRAINT uc_tab DO UPDATE SET note = EXCLUDED.note;

Best practices to prevent duplicate data (pandas/postgres)

Answers (1)

Related Questions