Python : reducing memory usage of small integers with missing values

Question

I am in the process of reducing the memory usage of my code. The goal of this code is handling some big dataset. Those are stored in Pandas dataframe if that is relevant.

Among many other data there are some small integers. As they contain some missing values (NA) Python has them set to the float64 type by default. I was trying to downcast them to some smaller int format (int8 or int16 for exemple), but I got an error because of the NA.

It seems that there are some new integer type (Int64) that can handle missing values but wouldn't help for the memory usage. I gave some tought about using a category, but I am not sure this will not create a bottleneck further down the pipeline. Downcasting float64 to float32 seems to be my main option for reducing memory usage (rounding error do not really matter for my usage).

Do I have a better option to reduce memory consumption of handling small integers with missing values ?

jdland · Accepted Answer

The new (Pandas v1.0+) "Integer Array" data types do allow significant memory savings. Missing values are recognized by Pandas .isnull() and also are compatible with Pyarrow feather format that is disk-efficient for writing data. Feather requires consistent data type by column. See Pandas documentation here. Here is an example. Note the capital 'I' in the Pandas-specific Int16 data type.

import pandas as pd
import numpy as np

dftemp = pd.DataFrame({'dt_col': ['1/1/2020',np.nan,'1/3/2020','1/4/2020'], 'int_col':[4,np.nan,3,1],
                      'float_col':[0.0,1.0,np.nan,4.5],'bool_col':[True, False, False, True],'text_col':['a','b',None,'d']})

#Write to CSV (to be read back in to fully simulate CSV behavior with missing values etc.)
dftemp.to_csv('MixedTypes.csv', index=False)

lst_cols = ['int_col','float_col','bool_col','text_col']
lst_dtypes = ['Int16','float','bool','object']
dict_types = dict(zip(lst_cols,lst_dtypes))

#Unoptimized DataFrame    
df = pd.read_csv('MixedTypes.csv')
df

Result:

     dt_col  int_col  float_col  bool_col text_col
0  1/1/2020      4.0        0.0      True        a
1       NaN      NaN        1.0     False        b
2  1/3/2020      3.0        NaN     False      NaN
3  1/4/2020      1.0        4.5      True        d

Check memory usage (with special focus on int_col):

df.memory_usage()

Result:

Index        128
dt_col        32
int_col       32
float_col     32
bool_col       4
text_col      32
dtype: int64

Repeat with explicit assignment of variable types --including Int16 for int_col

df2 = pd.read_csv('MixedTypes.csv', dtype=dict_types, parse_dates=['dt_col'])
print(df2)

      dt_col  int_col  float_col  bool_col text_col
0 2020-01-01        4        0.0      True        a
1        NaT             1.0     False        b
2 2020-01-03        3        NaN     False      NaN
3 2020-01-04        1        4.5      True        d

df2.memory_usage()

In larger scale data, this results in significant memory and disk space efficiency from my experience:

Index        128
dt_col        32
int_col       12
float_col     32
bool_col       4
text_col      32
dtype: int64

Python : reducing memory usage of small integers with missing values

Answers (1)

Related Questions