How does copying dataframes in different ways affect memory consumption?

Question

I am trying figure out how different ways of copying pandas dataframes affect the memory consumption of a python script using the memory_profiler package:

from copy import deepcopy
import pandas as pd

@profile
def func2(df):
    return df.copy()


@profile
def func():
    a = [1] * (10 ** 6)
    df = pd.DataFrame({"val": a})
    df2 = pd.DataFrame({"val": a})

    df3 = func2(df)
    df4 = df
    df5 = df.copy(deep=False)

    df6 = df.copy(deep=True)
    df6.loc[:,"val"] = 2

    df7 = deepcopy(df)
    df7.loc[:,"val"] = 2


if __name__ == "__main__":
    func()

I run the script like this: python -m memory_profiler mem_test.py

And these are the results I get:

Filename: mem_test.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     4  141.629 MiB  141.629 MiB           1   @profile
     5                                         def func2(df):
     6  141.641 MiB    0.012 MiB           1       return df.copy()


Filename: mem_test.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     9   71.824 MiB   71.824 MiB           1   @profile
    10                                         def func():
    11   79.457 MiB    7.633 MiB           1       a = [1] * (10 ** 6)
    12  133.996 MiB   54.539 MiB           1       df = pd.DataFrame({"val": a})
    13  141.629 MiB    7.633 MiB           1       df2 = pd.DataFrame({"val": a})
    14
    15  141.641 MiB  141.641 MiB           1       df3 = func2(df)
    16  141.641 MiB    0.000 MiB           1       df4 = df
    17  141.641 MiB    0.000 MiB           1       df5 = df.copy(deep=False)
    18
    19  141.641 MiB    0.000 MiB           1       df6 = df.copy(deep=True)
    20  141.820 MiB    0.180 MiB           1       df6.loc[:,"val"] = 2
    21
    22  141.820 MiB    0.000 MiB           1       df7 = deepcopy(df)
    23  141.820 MiB    0.000 MiB           1       df7.loc[:,"val"] = 2

Things that confuse me:

On line 12 and 13 we create identical data structures, yet the memory consumption is different
How to read line 15? Mem usage and increment are equal as if memory consumption was zero before this line?
I would expect line 19 or at least line 20 to cause notable increment in the memory consumption, yet it doesn't.
Lines 22 and 23: the same as above

What is the rationale behind the observations above? How does the different ways of copying (assignment, shallow copy, deep copy and copy.deepcopy) pandas dataframes differ in memory consumption of a python script?

If the chosen way of studying memory consumption is not valid for these use case what method would you suggest instead?

I am using python 3.10.1 and the following version of pandas and memory_profiler:

pandas==1.4.3
memory-profiler==0.60.0

EDIT:

I ran the code provided in the answer by @Ahmed AEK and the results were similar with what I got using memory_profiler package. Since memory_profiler is using psutil under the hood I begin to wonder whether the issue could be related to OS since now two approaches using psutil both fail to give correct results.

Originally I used os x Catalina but now ran another test with Ubuntu. With Ubuntu I get similar results with both the original code and the code in the answer by @Ahmed AEK. So my best guess at the moment is that there appears to be a bug with psutil that for some reason manifests with the code and os x.

How does copying dataframes in different ways affect memory consumption?

Answers (1)

Related Questions