jjei
jjei

Reputation: 1290

How does copying dataframes in different ways affect memory consumption?

I am trying figure out how different ways of copying pandas dataframes affect the memory consumption of a python script using the memory_profiler package:

from copy import deepcopy
import pandas as pd

@profile
def func2(df):
    return df.copy()


@profile
def func():
    a = [1] * (10 ** 6)
    df = pd.DataFrame({"val": a})
    df2 = pd.DataFrame({"val": a})

    df3 = func2(df)
    df4 = df
    df5 = df.copy(deep=False)

    df6 = df.copy(deep=True)
    df6.loc[:,"val"] = 2

    df7 = deepcopy(df)
    df7.loc[:,"val"] = 2


if __name__ == "__main__":
    func()

I run the script like this: python -m memory_profiler mem_test.py

And these are the results I get:

Filename: mem_test.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     4  141.629 MiB  141.629 MiB           1   @profile
     5                                         def func2(df):
     6  141.641 MiB    0.012 MiB           1       return df.copy()


Filename: mem_test.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     9   71.824 MiB   71.824 MiB           1   @profile
    10                                         def func():
    11   79.457 MiB    7.633 MiB           1       a = [1] * (10 ** 6)
    12  133.996 MiB   54.539 MiB           1       df = pd.DataFrame({"val": a})
    13  141.629 MiB    7.633 MiB           1       df2 = pd.DataFrame({"val": a})
    14
    15  141.641 MiB  141.641 MiB           1       df3 = func2(df)
    16  141.641 MiB    0.000 MiB           1       df4 = df
    17  141.641 MiB    0.000 MiB           1       df5 = df.copy(deep=False)
    18
    19  141.641 MiB    0.000 MiB           1       df6 = df.copy(deep=True)
    20  141.820 MiB    0.180 MiB           1       df6.loc[:,"val"] = 2
    21
    22  141.820 MiB    0.000 MiB           1       df7 = deepcopy(df)
    23  141.820 MiB    0.000 MiB           1       df7.loc[:,"val"] = 2

Things that confuse me:

What is the rationale behind the observations above? How does the different ways of copying (assignment, shallow copy, deep copy and copy.deepcopy) pandas dataframes differ in memory consumption of a python script?

If the chosen way of studying memory consumption is not valid for these use case what method would you suggest instead?

I am using python 3.10.1 and the following version of pandas and memory_profiler:

pandas==1.4.3
memory-profiler==0.60.0

EDIT:

I ran the code provided in the answer by @Ahmed AEK and the results were similar with what I got using memory_profiler package. Since memory_profiler is using psutil under the hood I begin to wonder whether the issue could be related to OS since now two approaches using psutil both fail to give correct results.

Originally I used os x Catalina but now ran another test with Ubuntu. With Ubuntu I get similar results with both the original code and the code in the answer by @Ahmed AEK. So my best guess at the moment is that there appears to be a bug with psutil that for some reason manifests with the code and os x.

Upvotes: 1

Views: 289

Answers (1)

Ahmed AEK
Ahmed AEK

Reputation: 17750

i can tell just by inspection that the numbers you got are no where near correct, and you might want to submit an issue to the profiler module creators, anyway, you can request the program consumption from the operating system using psutil, and if you just want to do some simple benchmarking (non production code) then you can make a simple wrapper to print memory consumption by a line.

a disadvantage to this is that you get the memory before and after the execution of a line, and will not catch momentary memory spikes inside functions (although none should occur if the library creator is smart enough not to make one, as is the case for pandas)

import os, psutil
import inspect
process = psutil.Process(os.getpid())
last_val = 0
def print_mem():
    # get memory
    mem = process.memory_info().rss / (1024 ** 2)  # in Mbytes
    global last_val
    increment = mem - last_val
    last_val = mem

    # dark magic to get line and line number in Cpython
    curframe = inspect.currentframe()
    calframe = inspect.getouterframes(curframe, 2)
    print('line_no =', calframe[1][2],'\t',end='')
    print(f'mem = {mem}\t\t',end='')
    print(f",increment = {increment}\t\t",end='')
    print('last_line = ',calframe[1][4][-2].strip())

integrating it into your benchmark is simple

def func2(df):
    print_mem()
    b = df.copy()
    print_mem()
    return b


def func():
    print_mem()
    a = [1] * (10 ** 6)
    print_mem()
    df = pd.DataFrame({"val": a})
    print_mem()
    df2 = pd.DataFrame({"val": a})
    print_mem()
    df3 = func2(df)
    df4 = df
    df5 = df.copy(deep=False)
    print_mem()
    df6 = df.copy(deep=True)
    print_mem()
    df6.loc[:,"val"] = 2
    print_mem()
    df7 = deepcopy(df)
    print_mem()
    df7.loc[:,"val"] = 2

and the result makes sense

line_no = 33    mem = 52.5078125        ,increment = 52.5078125     last_line =  def func():
line_no = 35    mem = 60.453125     ,increment = 7.9453125      last_line =  a = [1] * (10 ** 6)
line_no = 37    mem = 68.16796875       ,increment = 7.71484375     last_line =  df = pd.DataFrame({"val": a})
line_no = 39    mem = 75.80078125       ,increment = 7.6328125      last_line =  df2 = pd.DataFrame({"val": a})
line_no = 26    mem = 75.80078125       ,increment = 0.0        last_line =  def func2(df):
line_no = 28    mem = 83.43359375       ,increment = 7.6328125      last_line =  b = df.copy()
line_no = 43    mem = 83.43359375       ,increment = 0.0        last_line =  df5 = df.copy(deep=False)
line_no = 45    mem = 91.06640625       ,increment = 7.6328125      last_line =  df6 = df.copy(deep=True)
line_no = 47    mem = 91.25390625       ,increment = 0.1875     last_line =  df6.loc[:,"val"] = 2
line_no = 49    mem = 98.890625     ,increment = 7.63671875     last_line =  df7 = deepcopy(df)

a minor remark that executing the same function twice can have different memory consumption due to some extra stuff a library might load the first time you are executing a function, (probably the library is lazy-loading its functions)

alternatively commercial profilers may be able to catch momentary spikes, but i am only talking about a copy-paste workaround here, as monitoring these spikes requires another process to be monitoring your python process.

Upvotes: 2

Related Questions