CSV to MSSQL using pymssql

Question

the motive is to continuously look for new records in my CSV and insert the records to the mssql using pymssql library. The CSV initially has 244 rows and I'm trying to insert 1 value and wants to dynamically insert the new row only when the script is ran with the scheduler. I have the script which runs every 15 seconds to insert the values, but post inserting the values the first time, the second time the script throws 'Cannot insert duplicate key in object' as I have my first column DateID which is set a PK and terminates the statement from the first record itself, therefore doesn't insert the new row.

How do I encounter this.

Code:

def trial():
    try: 
        for row in df.itertuples():
         
            datevalue = datetime.datetime.strptime(row.OrderDate, format)

            query= "INSERT INTO data (OrderDate, Region, City, Category) VALUES (%s,%s,%s,%s)"
            cursor.execute(query, (datevalue, row.Region,row.City,row.Category))
        print('"Values inserted')
        conn.commit()
        conn.close()
    except Exception as e:
        print("Handle error", e)
        pass
        

schedule.every(15).seconds.do(trial)

Library used: pymssql SQL: MSSQL server 2019

Parfait · Accepted Answer

To avoid duplicate values, consider adjusting query to use EXCEPT clause (part of UNION and INTERSECT set operator family) against actual data. Also, consider using executemany by passing a nested list of all row/column data with DataFrame.to_numpy().tolist().

By the way if OrderDate column is a datetime type in data frame and database table, you do not need to re-format to string value.

def trial():
    try: 
        query= (
            "INSERT INTO data (OrderDate, Region, City, Category) "
            "SELECT %s, %s, %s, %s "
            "EXCEPT "
            "SELECT OrderDate, Region, City, Category "
            "FROM data"
        )

        vals = df[["OrderDate", "Region", "City", "Category"]].to_numpy()
        vals = tuple(map(tuple, vals))
        cur.executemany(query, vals)

        print('Values inserted')
        conn.commit()

    except Exception as e:
        print("Handle error", e)

    finally:
        cur.close()
        conn.close()

For a faster, bulk insert, consider using a staging, temp table:

# CREATE EMPTY TEMP TABLE 
query = "SELECT TOP 0 OrderDate, Region, City, Category INTO #pydata FROM data"
cur.execute(query)

# INSERT INTO TEMP TABLE
query= (
    "INSERT INTO #pydata (OrderDate, Region, City, Category) "
    "VALUES (%s, %s, %s, %s) "
)
vals = df[["OrderDate", "Region", "City", "Category"]].to_numpy()
vals = tuple(map(tuple, vals))
cur.execute("BEGIN TRAN")
cur.executemany(query, vals)

# MIGRATE TO FINAL TABLE
query= (
    "INSERT INTO data (OrderDate, Region, City, Category) "
    "SELECT OrderDate, Region, City, Category "
    "FROM #pydata "
    "EXCEPT "
    "SELECT OrderDate, Region, City, Category "
    "FROM data"
)
cur.execute(query)
conn.commit()
print("Values inserted")

CSV to MSSQL using pymssql

Answers (1)

Related Questions