pierre_j
pierre_j

Reputation: 983

Reducing pandas DataFrame memory consumption by use of scaled data for each column?

I would like to reduce the memory consumption when managing some pandas DataFrame. I am aware of the trick to switch from float64 to flot32 for instance and this is interesting.

To go even further, and knowing my numeric values have actually 'small' absolute min and max values, I was wondering if it it not possible to ask pandas to use a scale factor for a given column?

The best example can be with percent.

With percent, you know the min is 0 and the max is 1. These min and max values could be stored as attributes of the column.

I could then use int16 for instance and column values would be stored as scaled values between [-128; 127]. Then when used, they would be scaled back to their "original value" (with some rounding) using min & max values that have been stored as column attribute.

Is this kind of approach possible for management of pandas DataFrames?

Thanks for your help and feedback! Bests,

Upvotes: 0

Views: 192

Answers (1)

mustafasencer
mustafasencer

Reputation: 773

I use this useful helper function. But of course better solutions could be utilized.

def reduce_mem_usage(data_frame):
    start_mem_usg = data_frame.memory_usage(deep=True).sum() / 1024 ** 2
    print("Memory usage of the dataframe is : {:03.2f} {}".format(start_mem_usg, " MB"))
    for col in data_frame.columns:
        if data_frame[col].dtype not in [object, "datetime64", "datetime64[ns]"]:  # Exclude strings

            print("******************************")
            print("Column: ", col)
            print("dtype before: ", data_frame[col].dtype)
            is_int = False
            try:
                mx = data_frame[col].max()
            except Exception:
                continue
            mn = data_frame[col].min()
            try:
                as_int = data_frame[col].fillna(0).astype(numpy.int64)
                result = (data_frame[col] - as_int)
                result = result.sum()
                if result > -0.01 and result < 0.01:
                    is_int = True
            except:
                continue

            try:

                if is_int:
                    if mn >= 0:
                        if mx < 255:
                            data_frame[col] = data_frame[col].astype(numpy.uint8)
                        elif mx < 65535:
                            data_frame[col] = data_frame[col].astype(numpy.uint16)
                        elif mx < 4294967295:
                            data_frame[col] = data_frame[col].astype(numpy.uint32)
                        else:
                            data_frame[col] = data_frame[col].astype(numpy.uint64)
                    else:
                        if mn > numpy.iinfo(numpy.int8).min and mx < numpy.iinfo(numpy.int8).max:
                            data_frame[col] = data_frame[col].astype(numpy.int8)
                        elif mn > numpy.iinfo(numpy.int16).min and mx < numpy.iinfo(numpy.int16).max:
                            data_frame[col] = data_frame[col].astype(numpy.int16)
                        elif mn > numpy.iinfo(numpy.int32).min and mx < numpy.iinfo(numpy.int32).max:
                            data_frame[col] = data_frame[col].astype(numpy.int32)
                        elif mn > numpy.iinfo(numpy.int64).min and mx < numpy.iinfo(numpy.int64).max:
                            data_frame[col] = data_frame[col].astype(numpy.int64)

                            # Make float datatypes 32 bit
                else:
                    data_frame[col] = data_frame[col].astype(numpy.float32)
                print("dtype after: ", data_frame[col].dtype)
                print("******************************")
            except ValueError:
                continue
    print("___MEMORY USAGE AFTER COMPLETION:___")
    mem_usg = data_frame.memory_usage(deep=True).sum() / 1024 ** 2
    print("Memory usage is: {:03.2f} {}".format(mem_usg, " MB"))
    print("This is ", 100 * mem_usg / start_mem_usg, "% of the initial size")
    return data_frame

Upvotes: 2

Related Questions