Reputation: 13
I am stuck on an issue at the moment and hoping someone may be able to enlighten me.
My department had a few pandas scripts to take data from a CSV, do minor transformations and then write it to our SQL Server database using SQL Alchemy ORM (dataframe.to_sql). The connection for this is set up using the fast_executemany argument). The dataframe after processing is only about 50K rows and ~20 columns. It takes about 2/3 minutes to write to the SQL database.
I recently started exploring replacing the pandas portion of this script to Polars, to match some of our other processes. The transformation time goes down drastically after switching from pandas to polars, and once the transformations are done in Polars I convert the output frame to a Pandas DF using PolarsFrame.to_pandas().
My issue comes when I try to write this converted Polars frame to the SQL server, using the exact same method as I was for the Pandas dataframe. However, it takes about 25/30 minutes to write the same amount of data to SQL.
I can't seem to figure out why. My hunch, based on comparing the SQL tables after doing it both ways, is that it seems like Polars is writing to SQL in a random order. We have a "RecordId" automatically assigned for each row as it's inserted, and the Pandas dataframe is inserted sequentially. When I compare it to the Polars data table, the RecordId field goes "1,10,5,19,,5000,8, etc.
Is there something else I'm missing, or any reason for this behaviour?
Upvotes: 0
Views: 1158
Reputation: 18691
I'm guessing you have a groupby
somewhere in there. If you don't tell it maintain_order=True
then it doesn't keep things in order.
If you don't have a groupby
then you can issue a sort
before to_pandas
Upvotes: 0