Reputation: 5355
I am trying to prevent duplicate data from being added to a database based on three columns. In pandas, this is simple because I can do drop_duplicates(subset = ['customer_id', 'call_time', 'employee_id'], inplace = True, keep = 'last']
I recently installed postgresql and am loading data with SQLAlchemy with to_sql
.
Is it logical to read from my database at the start of the file, find the maximum date that exists in the file, and then only read/modify new data which occurs after that date? Say the max date in the file is 12/04, then I would add files which are 12/05 onward (max_date + pd.Timedelta(days = 1))
only to make sure I am not adding duplicate information.
My current setup, since I am only using .csv files, is to always read the full month (current month) of data to be safe. Then I overwrite a month-to-date csv file over and over again every day, adding one new day of data each time.
start_date = (central_time - pd.Timedelta(days = 0))- pd.offsets.MonthBegin()
end_date = central_time
days = ((pd.Timestamp(end_date) - pd.Timestamp(start_date)) / np.timedelta64(1, 'D')).astype(int)
Upvotes: 0
Views: 2172
Reputation: 175596
You can add UNIQUE
constraint to your colums like:
CREATE TABLE tab(id SERIAL PRIMARY KEY,
customer_id INT,
call_time TIMESTAMP,
employee_id INT,
note VARCHAR(100),
CONSTRAINT uc_tab UNIQUE (customer_id, call_time, employee_id)
);
And use INSERT INTO ... ON CONFLICT syntax
instead:
The optional ON CONFLICT clause specifies an alternative action to raising a unique violation or exclusion constraint violation error. For each individual row proposed for insertion, either the insertion proceeds, or, if an arbiter constraint or index specified by conflict_target is violated, the alternative conflict_action is taken. ON CONFLICT DO NOTHING simply avoids inserting a row as its alternative action. ON CONFLICT DO UPDATE updates the existing row that conflicts with the row proposed for insertion as its alternative action.
INSERT INTO tab(customer_id, call_time, employee_id, note)
VALUES (1, TO_TIMESTAMP('2015-12-01','YYYY-MM-DD'), 1, 'Note1');
INSERT INTO tab(customer_id, call_time, employee_id, note)
VALUES (1, TO_TIMESTAMP('2015-12-01','YYYY-MM-DD'), 1, 'Note2');
-- ERROR: duplicate key value violates unique constraint "uc_tab" Detail:
-- Key (customer_id, call_time, employee_id)=(1, 2015-12-01 00:00:00, 1)
-- already exists.
INSERT INTO tab(customer_id, call_time, employee_id, note)
VALUES (1, TO_TIMESTAMP('2015-12-01','YYYY-MM-DD'), 1, 'Note2')
ON CONFLICT ON CONSTRAINT uc_tab DO UPDATE SET note = EXCLUDED.note;
Upvotes: 3