Anant Vaibhav
Anant Vaibhav

Reputation: 365

Fetching Million records from SQL server and saving to pandas dataframe

I am trying to fetch data from SQL server Database (Just a simple SELECT * query).

The table contains around 3-5 Million records. Perfomring a SELECT * on the SQL server directly using SSMS takes around 11-15 minutes.

However, when I am connecting via Python and trying to save data into a pandas dataframe, it takes forever. More than 1 hour.

Here is the code I am using:

import pymssql
import pandas as pd

startTime = datetime.now()

## instance a python db connection object- same form as psycopg2/python-mysql drivers also

conn = pymssql.connect(server=r"xyz", database = "abc", user="user",password="pwd")  
print ('Connecting to DB: ',datetime.now() - startTime )


stmt = "SELECT * FROM BIG_TABLE;"
# Excute Query here
df_big_table = pd.read_sql(stmt,conn)

There must be a way to do this in a better way? Perhaps parallel processing or something to fetch the data quickly.

My end goal is to Migrate this table from SQL server to PostGres.

This is the way I am doing:

  1. Fetch data from SQL server using python
  2. Save it to a pandas dataframe
  3. Save this data in CSV to disk.
  4. Copy the CSV from disk to Postgres.

Proably, I can combine step 3,4 so that I can do the transition in memory, rather than using disk IO.

There are many complexity like table constrains and definitions, etc. Which I will be taking care later on. I cannot use a third party tool.

I am stuck at Step 1,2. So help with the Python script/ Some other opensource language would be really appreciated.

If there is any other way to reach to my end goal, I welcome sugessions!

Upvotes: 2

Views: 2717

Answers (1)

Betjens
Betjens

Reputation: 1401

Have you tried using 'chunksize' option of pandas.read_sql? you can get all of that into a single dataframe and produce the csv.

If it takes more time then you can split each chunk into multiple files using the pandas.read_sql as a iterator and then after you did your work consolidate those files into a single one and submit it to postgres.

Upvotes: 2

Related Questions