Fetching Million records from SQL server and saving to pandas dataframe

Question

I am trying to fetch data from SQL server Database (Just a simple SELECT * query).

The table contains around 3-5 Million records. Perfomring a SELECT * on the SQL server directly using SSMS takes around 11-15 minutes.

However, when I am connecting via Python and trying to save data into a pandas dataframe, it takes forever. More than 1 hour.

Here is the code I am using:

import pymssql
import pandas as pd

startTime = datetime.now()

## instance a python db connection object- same form as psycopg2/python-mysql drivers also

conn = pymssql.connect(server=r"xyz", database = "abc", user="user",password="pwd")  
print ('Connecting to DB: ',datetime.now() - startTime )


stmt = "SELECT * FROM BIG_TABLE;"
# Excute Query here
df_big_table = pd.read_sql(stmt,conn)

There must be a way to do this in a better way? Perhaps parallel processing or something to fetch the data quickly.

My end goal is to Migrate this table from SQL server to PostGres.

This is the way I am doing:

Fetch data from SQL server using python
Save it to a pandas dataframe
Save this data in CSV to disk.
Copy the CSV from disk to Postgres.

Proably, I can combine step 3,4 so that I can do the transition in memory, rather than using disk IO.

There are many complexity like table constrains and definitions, etc. Which I will be taking care later on. I cannot use a third party tool.

I am stuck at Step 1,2. So help with the Python script/ Some other opensource language would be really appreciated.

If there is any other way to reach to my end goal, I welcome sugessions!

Fetching Million records from SQL server and saving to pandas dataframe

Answers (1)

Related Questions