Juan David
Juan David

Reputation: 2797

Optimal approach to bulk insert of pandas dataframe into PostgreSQL table

I need to upload multiple excel files to a postgresql table but they can olverlap each other in several registers, therefore I need to be aware of IntegrityErrors. I'm following two approaches:

cursor.copy_from: The fastest approach but I don't know how to catch and control all Integrityerrors due to duplicate registers

streamCSV = StringIO()
streamCSV.write(invoicing_info.to_csv(index=None, header=None, sep=';')) 
streamCSV.seek(0)  

with conn.cursor() as c:
    c.copy_from(streamCSV, "staging.table_name", columns=dataframe.columns, sep=';')
    conn.commit()

cursor.execute: I can count and handle each exception but it is very slow.

data = invoicing_info.to_dict(orient='records')

with cursor as c:
    for entry in data:
        try:
            c.execute(DLL_INSERT, entry)
            successful_inserts += 1
            connection.commit()
            print('Successful insert. Operation number {}'.format(successful_inserts))
        except psycopg2.IntegrityError as duplicate:
            duplicate_registers += 1
            connection.rollback()
            print('Duplicate entry. Operation number {}'.format(duplicate_registers))

At the end of the routine, I need to determine the following info:

print("Initial shape: {}".format(invoicing_info.shape))
print("Successful inserts: {}".format(successful_inserts))
print("Duplicate entries: {}".format(duplicate_registers))

How can I modify the first approach to control all exceptions? How can I optimize the second approach?

Upvotes: 2

Views: 1312

Answers (2)

Andrew Pederson
Andrew Pederson

Reputation: 167

If the excel sheets contain duplicate records, Pandas seems a likely choice for identifying and eliminated dupes: https://33sticks.com/python-for-business-identifying-duplicate-data/. Or is the issue that different records in different sheets have the same id/index? If so, a similar approach could work where you use Pandas to isolate the ids used multiple times and then correct them with unique identifiers before attempting to upload to the SQL db.

For a bulk upload, I'd use an ORM. SQLAlchemy has some great info on bulk uploads: http://docs.sqlalchemy.org/en/rel_1_0/orm/persistence_techniques.html#bulk-operations, and there's a related discussion here: Bulk insert with SQLAlchemy ORM

Upvotes: 0

Ilya Dyoshin
Ilya Dyoshin

Reputation: 4624

while you have duplicate IDs in different excel sheets you have to answer for yourself how you make a decision to data from which excel sheet to trust?

while you are using multiple tables, and will use approach to have at least one row from conflicting pair you can always do following:

  • create temporary tables for each excel sheet
  • upload data to each table for excel sheet (like you do now in a bulk)
  • make an insert from select picking distinct on(id), in a manner:

INSERT INTO staging.table_name(id, col1, col2 ...)
SELECT DISTINCT ON(id) 
     id, col1, col2
FROM 
(
    SELECT id, col1, col2 ... 
       FROM staging.temp_table_for_excel_sheet1
    UNION
    SELECT id, col1, col2 ... 
       FROM staging.temp_table_for_excel_sheet2
    UNION
    SELECT id, col1, col2 ... 
       FROM staging.temp_table_for_excel_sheet3
) as data

with such insert postgreSQL will take the random row out of non-unique id sets.

In case you would like to trust the first record you can add some order:

INSERT INTO staging.table_name(id, col1, col2 ...)
SELECT DISTINCT ON(id) 
     id, ordering_column col1, col2
FROM 
(
    SELECT id, 1 as ordering_column, col1, col2 ... 
       FROM staging.temp_table_for_excel_sheet1
    UNION
    SELECT id, 2 as ordering_column, col1, col2 ... 
       FROM staging.temp_table_for_excel_sheet2
    UNION
    SELECT id, 3 as ordering_column, col1, col2 ... 
       FROM staging.temp_table_for_excel_sheet3
) as data
ORDER BY ordering_column

for initial count of objects:

SELECT sum(count)
FROM 
( 
  SELECT count(*) as count FROM temp_table_for_excel_sheet1
  UNION
  SELECT count(*) as count FROM temp_table_for_excel_sheet2
  UNION
  SELECT count(*) as count FROM temp_table_for_excel_sheet3
) as data

after finishing this bulk inserts you can run select count(*) FROM staging.table_name to get a result for total number of inserted records

for duplicate count you can run:

SELECT sum(count)
FROM 
(
SELECT count(*) as count 
FROM  temp_table_for_excel_sheet2 WHERE id in (select id FROM temp_table_for_excel_sheet1 )

UNION

SELECT count(*) as count 
FROM  temp_table_for_excel_sheet3 WHERE id in (select id FROM temp_table_for_excel_sheet1 )
)

UNION

SELECT count(*) as count 
FROM  temp_table_for_excel_sheet3 WHERE id in (select id FROM temp_table_for_excel_sheet2 )
) as data

Upvotes: 3

Related Questions