Reputation: 398
I am in the process of reducing the memory usage of my code. The goal of this code is handling some big dataset. Those are stored in Pandas dataframe if that is relevant.
Among many other data there are some small integers. As they contain some missing values (NA) Python has them set to the float64 type by default. I was trying to downcast them to some smaller int format (int8 or int16 for exemple), but I got an error because of the NA.
It seems that there are some new integer type (Int64) that can handle missing values but wouldn't help for the memory usage. I gave some tought about using a category, but I am not sure this will not create a bottleneck further down the pipeline. Downcasting float64 to float32 seems to be my main option for reducing memory usage (rounding error do not really matter for my usage).
Do I have a better option to reduce memory consumption of handling small integers with missing values ?
Upvotes: 0
Views: 608
Reputation: 54
The new (Pandas v1.0+) "Integer Array" data types do allow significant memory savings. Missing values are recognized by Pandas .isnull() and also are compatible with Pyarrow feather format that is disk-efficient for writing data. Feather requires consistent data type by column. See Pandas documentation here. Here is an example. Note the capital 'I' in the Pandas-specific Int16 data type.
import pandas as pd
import numpy as np
dftemp = pd.DataFrame({'dt_col': ['1/1/2020',np.nan,'1/3/2020','1/4/2020'], 'int_col':[4,np.nan,3,1],
'float_col':[0.0,1.0,np.nan,4.5],'bool_col':[True, False, False, True],'text_col':['a','b',None,'d']})
#Write to CSV (to be read back in to fully simulate CSV behavior with missing values etc.)
dftemp.to_csv('MixedTypes.csv', index=False)
lst_cols = ['int_col','float_col','bool_col','text_col']
lst_dtypes = ['Int16','float','bool','object']
dict_types = dict(zip(lst_cols,lst_dtypes))
#Unoptimized DataFrame
df = pd.read_csv('MixedTypes.csv')
df
Result:
dt_col int_col float_col bool_col text_col
0 1/1/2020 4.0 0.0 True a
1 NaN NaN 1.0 False b
2 1/3/2020 3.0 NaN False NaN
3 1/4/2020 1.0 4.5 True d
Check memory usage (with special focus on int_col):
df.memory_usage()
Result:
Index 128
dt_col 32
int_col 32
float_col 32
bool_col 4
text_col 32
dtype: int64
Repeat with explicit assignment of variable types --including Int16 for int_col
df2 = pd.read_csv('MixedTypes.csv', dtype=dict_types, parse_dates=['dt_col'])
print(df2)
dt_col int_col float_col bool_col text_col
0 2020-01-01 4 0.0 True a
1 NaT <NA> 1.0 False b
2 2020-01-03 3 NaN False NaN
3 2020-01-04 1 4.5 True d
df2.memory_usage()
In larger scale data, this results in significant memory and disk space efficiency from my experience:
Index 128
dt_col 32
int_col 12
float_col 32
bool_col 4
text_col 32
dtype: int64
Upvotes: 2