encore2097
encore2097

Reputation: 503

force int32 as dtype instead of int64 in pandas load_csv with dtype and converters

https://github.com/pandas-dev/pandas/pull/2708 says propagation of other types is working however, I'm unable to load my hex coded values into int32, they go into the dataframe as int64

data

2009-01-01T18:55:25Z,574,575,574,575,574,575,574,575,2,True
2009-01-01T18:56:55Z,574,575,574,575,573,574,573,574,2,True
2009-01-01T18:57:25Z,573,574,573,574,573,574,573,574,2,True
2009-01-01T18:57:30Z,573,574,573,574,573,574,573,574,2,True
2009-01-01T19:07:20Z,574,575,574,575,574,575,574,575,1,True
2009-01-01T19:07:55Z,574,575,574,575,574,575,574,575,1,True

names:

names = [
    'datetime',
    'sensorA',
    'sensorB',
    'sensorC',
     ...
    'signal',
]

conversion function:

def hex2int(x):
    return int(x, 16) * 100

converters:

convs = { i : hex2int for i in range(1,9) }

dtypes:

raw_dtypes = {
    'datetime': datetime.datetime,
    'sensorA': 'int32',
    'sensorA': 'int32',
    'sensorA': 'int32',
     ...
    'signal': 'int32',
}

read_csv:

df = pd.read_csv(filepath, delimiter=',', header=None, names=names, dtype=raw_dtypes, usecols=range(0, NUM_COLS-1), converters=convs, parse_dates=['datetime'])

Result:

>>> df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1308 entries, 0 to 1307
Data columns (total 10 columns):
datetime    1308 non-null datetime64[ns]
sensorA     1308 non-null int64
sensorB     1308 non-null int64
sensorC     1308 non-null int64
sensorD     1308 non-null int64
sensorE     1308 non-null int64
sensorF      1308 non-null int64
sensorG    1308 non-null int64
sensorH    1308 non-null int64
signal      1308 non-null int32
dtypes: datetime64[ns](1), int32(1), int64(8)

The last column('signal') doesnt use a converter and uses the correct dtype according to the docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html ( If converters are specified, they will be applied INSTEAD of dtype conversion. )

I'm pretty sure I'm not overflowing anything into int64, my ranges are 160000 - 80000. I've tried casting the return from the converter as return np.int32(x, 16) * 100 but that did not change anything

Upvotes: 2

Views: 2897

Answers (1)

chrisb
chrisb

Reputation: 52236

As documentation says, if both a converter and dtype is specified for a column, only the converter will be applied. I think in version 0.20+ this generates a warning.

If a converter is applied, the data in that column takes a generic inference path, as if you had passed pd.Series([...converted data ...], which uses int64 as the default.

So for now, the best you can do is cast the dtype after the fact. Something like:

df = df.astype({'sensorA': 'int32', 'sensorB': 'int32'}) #etc

Upvotes: 3

Related Questions