Aditya Mertia
Aditya Mertia

Reputation: 515

pandas to_sql insert ignore

I want to incrementally keep adding data frame rows into MySQL DB avoiding any duplicate entries to go in MySQL.

I am currently doing this by looping through every row using df.apply()and calling MySQL insert ignore(duplicates) to add unique rows into MySQL database. But using pandas.apply is very slow(45 secs for 10k rows). I want to achieve this using pandas.to_sql() method which takes 0.5 secs to push 10k entries into DB but doesn't support ignore duplicate in append mode. Is there an efficient and fast way to achieve this?

Input CSV

Date,Open,High,Low,Close,Volume
1994-01-03,111.7,112.75,111.55,112.65,0
1994-01-04,112.68,113.47,112.2,112.65,0
1994-01-05,112.6,113.63,112.3,113.0,0
1994-01-06,113.02,113.43,112.25,112.62,0
1994-01-07,112.55,112.8,111.5,111.88,0
1994-01-10,111.8,112.43,111.35,112.25,0
1994-01-11,112.18,112.88,112.05,112.4,0
1994-01-12,112.38,112.82,111.95,112.28,0

code

nifty_data.to_sql(name='eod_data', con=engine, if_exists = 'append', index=False) # option-1 
nifty_data.apply(addToDb, axis=1) # option-2 

def addToDb(row):
    sql = "INSERT IGNORE INTO eod_data (date, open, high, low, close, volume) VALUES (%s,%s,%s,%s,%s,%s)"
    val = (row['Date'], row['Open'], row['High'], row['Low'], row['Close'], row['Volume'])
    mycursor.execute(sql, val)
    mydb.commit()`

option-1: doesn't allow insert ignore (~0.5 secs)

option-2: has to loop through and is very slow (~45 secs)

Upvotes: 13

Views: 9902

Answers (1)

pedrogfp
pedrogfp

Reputation: 589

You can create a temporary table:

nifty_data.to_sql(name='temporary_table', con=engine, if_exists = 'append', index=False)

And then run an INSERT IGNORE statement from that:

with engine.begin() as cnx:
    insert_sql = 'INSERT IGNORE INTO eod_data (SELECT * FROM temporary_table)'
    cnx.execute(insert_sql)

just make sure the column orders are the same or you might have to manually declare them.

Upvotes: 19

Related Questions