Petter T
Petter T

Reputation: 3627

Pandas read_sql_query converts 32-bit data to 64-bit

I am using pandas.read_sql_query to read some data from Sql server. The data types I read are int and real in Sql server, that is 32-bit integers and 32-bit floating-point values. But in the resulting dataframe the dtypes are int64 and float64. I could convert this back to 32-bit after getting the data into the Pandas dataframe, but I am reading a large amount of data so this would give memory problems. Is there a way to keep the data from Sql server in 32-bit when reading into the dataframe?

Here is an example of the db connection:

cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)

query = ("SELECT  "
"MId, "
"SomeInt, "
"SomeReal "
"From dbo.Example;")

df = pd.read_sql_query(query, cnxn)

Upvotes: 3

Views: 497

Answers (1)

Jason Cook
Jason Cook

Reputation: 1511

Looking at the current Pandas code, I believe the only option is to develop your own solution using a loop approach. Using "read_sql_query" or "read_sql" will ultimately end up calling "pandas.DataFrame.from_records" from a list of pyodbc rows. "DataFrame.from_record" doesn't accept specifying any datatypes.

"pandas.DataFrame" likely won't help as it only accepts specifying a single data type. Instead you could use numpy to help build the DataFrame in a loop approach.

import pandas as pd
import numpy as np
import pyodbc

server_name = 'localhost'
database_name = 'AdventureWorks2019'
driver = 'ODBC Driver 17 for SQL Server'

connection = pyodbc.connect(driver='{'+driver+'}', server=server_name, database=database_name, trusted_connection='yes') 

cursor = connection.cursor()


# sample MSSQL AdventureWorks database
results = cursor.execute('SELECT BusinessEntityID, NationalIDNumber FROM HumanResources.Employee')

# preallocate empty DataFrame with known data types
data = pd.DataFrame(columns=['BusinessEntityID','NationalIDNumber'])
data = data.astype({'BusinessEntityID': 'int32', 'NationalIDNumber': 'object'})


while True:

    # load results into memory until there are no more to process
    # this is ultimately what "read_sql_query" would build a DataFrame from
    loop = results.fetchmany(50)
    if len(loop)==0:
        break

    # convert to tuple for numpy
    loop = [tuple(x) for x in loop]

    # convert to numpy array with specific type
    rows = np.array(loop, dtype=[('BusinessEntityID', np.int32), ('NationalIDNumber', '<S15')])

    # convert to dataframe and append
    rows = pd.DataFrame.from_records(rows)

    data = data.append(rows)

print(data.dtypes)

BusinessEntityID     int32
NationalIDNumber    object
dtype: object

See this link for more about strings in numpy:

What is the difference between the types <type 'numpy.string_'> and <type 'str'>?

Upvotes: 1

Related Questions