Reputation: 3627
I am using pandas.read_sql_query
to read some data from Sql server. The data types I read are int
and real
in Sql server, that is 32-bit integers and 32-bit floating-point values. But in the resulting dataframe the dtypes are int64
and float64
. I could convert this back to 32-bit after getting the data into the Pandas dataframe, but I am reading a large amount of data so this would give memory problems. Is there a way to keep the data from Sql server in 32-bit when reading into the dataframe?
Here is an example of the db connection:
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
query = ("SELECT "
"MId, "
"SomeInt, "
"SomeReal "
"From dbo.Example;")
df = pd.read_sql_query(query, cnxn)
Upvotes: 3
Views: 497
Reputation: 1511
Looking at the current Pandas code, I believe the only option is to develop your own solution using a loop approach. Using "read_sql_query" or "read_sql" will ultimately end up calling "pandas.DataFrame.from_records" from a list of pyodbc rows. "DataFrame.from_record" doesn't accept specifying any datatypes.
"pandas.DataFrame" likely won't help as it only accepts specifying a single data type. Instead you could use numpy to help build the DataFrame in a loop approach.
import pandas as pd
import numpy as np
import pyodbc
server_name = 'localhost'
database_name = 'AdventureWorks2019'
driver = 'ODBC Driver 17 for SQL Server'
connection = pyodbc.connect(driver='{'+driver+'}', server=server_name, database=database_name, trusted_connection='yes')
cursor = connection.cursor()
# sample MSSQL AdventureWorks database
results = cursor.execute('SELECT BusinessEntityID, NationalIDNumber FROM HumanResources.Employee')
# preallocate empty DataFrame with known data types
data = pd.DataFrame(columns=['BusinessEntityID','NationalIDNumber'])
data = data.astype({'BusinessEntityID': 'int32', 'NationalIDNumber': 'object'})
while True:
# load results into memory until there are no more to process
# this is ultimately what "read_sql_query" would build a DataFrame from
loop = results.fetchmany(50)
if len(loop)==0:
break
# convert to tuple for numpy
loop = [tuple(x) for x in loop]
# convert to numpy array with specific type
rows = np.array(loop, dtype=[('BusinessEntityID', np.int32), ('NationalIDNumber', '<S15')])
# convert to dataframe and append
rows = pd.DataFrame.from_records(rows)
data = data.append(rows)
print(data.dtypes)
BusinessEntityID int32
NationalIDNumber object
dtype: object
See this link for more about strings in numpy:
What is the difference between the types <type 'numpy.string_'> and <type 'str'>?
Upvotes: 1