Ayam
Ayam

Reputation: 169

Executing an SQL update statement from a pandas dataframe

Context: I am using MSSQL, pandas, and pyodbc.

Steps:

Now how do I execute the sql code in the auxilliary column, without looping through each row?

sample data

The first two columns are obtained by querying dbo.table, the third columns exists but is empty in the database. The fourth column only exists in the dataframe to prepare the SQL statement that would correspond to updating dbo.table

ID raw processed strSQL
1 [email protected] lorum ipsum UPDATE t SET t.processed = 'lorum ipsum' FROM dbo.table t WHERE t.ID = 1
2 [email protected] rumlo sumip UPDATE t SET t.processed = 'rumlo sumip' FROM dbo.table t WHERE t.ID = 2
3 ... ... ...

I would like to execute the SQL script in each row in an efficient manner.

Upvotes: 1

Views: 5344

Answers (1)

Gord Thompson
Gord Thompson

Reputation: 123409

After I recommended .executemany() in a comment to the question, a subsequent comment from @Charlieface suggested that a table-valued parameter (TVP) would provide even better performance. I didn't think it would make that much difference, but I was wrong.

For an existing table named MillionRows

ID  TextField
--  ---------
 1  foo
 2  bar
 3  baz
…

and example data of the form

num_rows = 1_000_000
rows = [(f"text{x:06}", x + 1) for x in range(num_rows)]
print(rows)
# [('text000000', 1), ('text000001', 2), ('text000002', 3), …]

my test using a standard executemany() call with cnxn.autocommit = False and crsr.fast_executemany = True

crsr.executemany("UPDATE MillionRows SET TextField = ? WHERE ID = ?", rows)

took about 180 seconds (3 minutes).

However, by creating a user-defined table type

CREATE TYPE dbo.TextField_ID AS TABLE 
(
    TextField nvarchar(255) NULL, 
    ID int NOT NULL, 
    PRIMARY KEY (ID)
)

and a stored procedure

CREATE PROCEDURE [dbo].[mr_update]
@tbl dbo.TextField_ID READONLY
AS
BEGIN
    SET NOCOUNT ON;
    UPDATE MillionRows SET TextField = t.TextField
    FROM MillionRows mr INNER JOIN @tbl t ON mr.ID = t.ID
END

when I used

crsr.execute("{CALL mr_update (?)}", (rows,))

it did the same update in approximately 80 seconds (less than half the time).

Upvotes: 3

Related Questions