Reputation: 365
I am trying to fetch data from SQL server Database (Just a simple SELECT * query).
The table contains around 3-5 Million records. Perfomring a SELECT * on the SQL server directly using SSMS takes around 11-15 minutes.
However, when I am connecting via Python and trying to save data into a pandas dataframe, it takes forever. More than 1 hour.
Here is the code I am using:
import pymssql
import pandas as pd
startTime = datetime.now()
## instance a python db connection object- same form as psycopg2/python-mysql drivers also
conn = pymssql.connect(server=r"xyz", database = "abc", user="user",password="pwd")
print ('Connecting to DB: ',datetime.now() - startTime )
stmt = "SELECT * FROM BIG_TABLE;"
# Excute Query here
df_big_table = pd.read_sql(stmt,conn)
There must be a way to do this in a better way? Perhaps parallel processing or something to fetch the data quickly.
My end goal is to Migrate this table from SQL server to PostGres.
This is the way I am doing:
Proably, I can combine step 3,4 so that I can do the transition in memory, rather than using disk IO.
There are many complexity like table constrains and definitions, etc. Which I will be taking care later on. I cannot use a third party tool.
I am stuck at Step 1,2. So help with the Python script/ Some other opensource language would be really appreciated.
If there is any other way to reach to my end goal, I welcome sugessions!
Upvotes: 2
Views: 2717
Reputation: 1401
Have you tried using 'chunksize' option of pandas.read_sql? you can get all of that into a single dataframe and produce the csv.
If it takes more time then you can split each chunk into multiple files using the pandas.read_sql as a iterator and then after you did your work consolidate those files into a single one and submit it to postgres.
Upvotes: 2