txg
txg

Reputation: 83

Pandas Series with dtype=int defaulting to int32 instead of int64 on 64-bit Python environment

I'm working on a Windows system with a 64-bit version of Python (Python 3.10.13, packaged by Anaconda, Inc.). When I run Python, the header indicates that it's a 64-bit environment: "Python 3.10.13 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:24:38) [MSC v.1916 64 bit (AMD64)] on win32". I am not sure why it shows win32, since I have a win64 machine

I've verified the bitness through various methods. First using the following code, which correctly shows that I'm running a 64-bit Python environment.:

import platform
import sys

assert platform.architecture()[0] == "64bit"
assert sys.maxsize > 2**32

Next, I checked my conda version with conda info, which gives platform : win-64.

However, when I use Pandas (version 2.1.3) and create an empty Series with dtype=int:

import pandas as pd

print(pd.Series([1,2,3], dtype=int).dtype)

It shows 'int32' instead of 'int64'. I expected it to default to int64 in a 64-bit environment. If I do not specify int, like print(pd.Series([1,2,3]).dtype) it prints'int64'.

Why is Pandas defaulting to int32 instead of int64 in my 64-bit Python environment, and how can I ensure that it defaults to int64?

I do not want to explicitly convert all my DataFrames with .astype("int64"), since that could result in failing tests on other machines.

Upvotes: 1

Views: 892

Answers (1)

Phenyl
Phenyl

Reputation: 583

Restating the problem

Your test fails because the data cast and expected data have slightly different types, like int32 vs. int64. Something like assert_frame_equal(df1, df2, check_dtype='equiv') would be handy but it does not work because pandas uses the hard check of assert_attr_equal under the hood.

You don't want to use assert_frame_equal(df1, df2, check_dtype=False) because it does not check the data type at all, which is bad.

Workaround

My workaround is to cast columns with equivalent types into the same one in my tests.

Example

import pandas as pd


a = pd.DataFrame({'Int': [1, 2, 3], 'Float': [0.57, 0.179, 0.213]})  # Automatic type casting
# Force 32-bit
b = a.copy()
b['Int'] = b['Int'].astype('int32')
b['Float'] = b['Float'].astype('float32')
# Force 64-bit
c = a.copy()
c['Int'] = c['Int'].astype('int64')
c['Float'] = c['Float'].astype('float64')
try:
    pd.testing.assert_frame_equal(b, c)
    print('Success')
except AssertionError as err:
    print(err)

gives:

Attributes of DataFrame.iloc[:, 0] (column name="Int") are different

Attribute "dtype" are different
[left]:  int32
[right]: int64

Workaround function:

def assert_frame_equiv(left: pd.DataFrame, right: pd.DataFrame) -> None:
    """Convert equivalent data types to same before comparing."""
    # First, check that the columns are the same.
    pd.testing.assert_index_equal(left.columns, right.columns, check_order=False)
    # Knowing columns names are the same, cast the same data type if equivalent.
    for col_name in left.columns:
        lcol = left[col_name]
        rcol = right[col_name]
        if (
            (pd.api.types.is_integer_dtype(lcol) and pd.api.types.is_integer_dtype(rcol))
            or (pd.api.types.is_float_dtype(lcol) and pd.api.types.is_float_dtype(rcol))
        ):
            left[col_name] = lcol.astype(rcol.dtype)

    return pd.testing.assert_frame_equal(left, right, check_like=True)


try:
    assert_frame_equiv(b, c)
    print('Success')
except AssertionError as err:
    print(err)

Which gives:

Success

EDIT

I opened a feature request to add check_dtype='equiv' to pandas.

Upvotes: 0

Related Questions