Reputation: 983
I would like to reduce the memory consumption when managing some pandas DataFrame. I am aware of the trick to switch from float64 to flot32 for instance and this is interesting.
To go even further, and knowing my numeric values have actually 'small' absolute min and max values, I was wondering if it it not possible to ask pandas to use a scale factor for a given column?
The best example can be with percent.
With percent, you know the min is 0 and the max is 1. These min and max values could be stored as attributes of the column.
I could then use int16 for instance and column values would be stored as scaled values between [-128; 127]. Then when used, they would be scaled back to their "original value" (with some rounding) using min & max values that have been stored as column attribute.
Is this kind of approach possible for management of pandas DataFrames?
Thanks for your help and feedback! Bests,
Upvotes: 0
Views: 192
Reputation: 773
I use this useful helper function. But of course better solutions could be utilized.
def reduce_mem_usage(data_frame):
start_mem_usg = data_frame.memory_usage(deep=True).sum() / 1024 ** 2
print("Memory usage of the dataframe is : {:03.2f} {}".format(start_mem_usg, " MB"))
for col in data_frame.columns:
if data_frame[col].dtype not in [object, "datetime64", "datetime64[ns]"]: # Exclude strings
print("******************************")
print("Column: ", col)
print("dtype before: ", data_frame[col].dtype)
is_int = False
try:
mx = data_frame[col].max()
except Exception:
continue
mn = data_frame[col].min()
try:
as_int = data_frame[col].fillna(0).astype(numpy.int64)
result = (data_frame[col] - as_int)
result = result.sum()
if result > -0.01 and result < 0.01:
is_int = True
except:
continue
try:
if is_int:
if mn >= 0:
if mx < 255:
data_frame[col] = data_frame[col].astype(numpy.uint8)
elif mx < 65535:
data_frame[col] = data_frame[col].astype(numpy.uint16)
elif mx < 4294967295:
data_frame[col] = data_frame[col].astype(numpy.uint32)
else:
data_frame[col] = data_frame[col].astype(numpy.uint64)
else:
if mn > numpy.iinfo(numpy.int8).min and mx < numpy.iinfo(numpy.int8).max:
data_frame[col] = data_frame[col].astype(numpy.int8)
elif mn > numpy.iinfo(numpy.int16).min and mx < numpy.iinfo(numpy.int16).max:
data_frame[col] = data_frame[col].astype(numpy.int16)
elif mn > numpy.iinfo(numpy.int32).min and mx < numpy.iinfo(numpy.int32).max:
data_frame[col] = data_frame[col].astype(numpy.int32)
elif mn > numpy.iinfo(numpy.int64).min and mx < numpy.iinfo(numpy.int64).max:
data_frame[col] = data_frame[col].astype(numpy.int64)
# Make float datatypes 32 bit
else:
data_frame[col] = data_frame[col].astype(numpy.float32)
print("dtype after: ", data_frame[col].dtype)
print("******************************")
except ValueError:
continue
print("___MEMORY USAGE AFTER COMPLETION:___")
mem_usg = data_frame.memory_usage(deep=True).sum() / 1024 ** 2
print("Memory usage is: {:03.2f} {}".format(mem_usg, " MB"))
print("This is ", 100 * mem_usg / start_mem_usg, "% of the initial size")
return data_frame
Upvotes: 2