How is Pandas DataFrame handled across multiple custom functions when passed as argument?

Question

We have a project where we have multiple *.py scripts with functions that receive and return pandas dataframe variable(s) as arguments.

But this make me wonder: What is the behavior in memory of the dataframe variable when they are passed as argument or as returned variables from those functions?

Does modifying the df variable alters the parent/main/global variable as well?

Consider the following example:

import pandas as pd

def add_Col(df): 
   df["New Column"] = 10 * 3

def mod_Col(df):
   df["Existing Column"] = df["Existing Column"] ** 2

data = [0,1,2,3]
df = pd.DataFrame(data,columns=["Existing Column"])

add_Col(df)
mod_col(df)

df

When df is displayed at the end: Will the new Column show up? what about the change made to "Existing Column" when calling mod_col? Did invoking add_Col function create a copy of df or only a pointer?

What is the best practice when passing dataframes into functions becuase if they are large enough I am sure creating copies will have both performance and memory implications right?

ALollz · Accepted Answer

It depends. DataFrames are mutable objects, so like lists, they can be modified within a function, without needing to return the object.

On the other hand, the vast majority of pandas operations will return a new object so modifications would not change the underlying DataFrame. For instance, below you can see that changing values with .loc will modify the original, but if you were to multiply the entire DataFrame (which returns a new object) the original remains unchanged.

If you had a function that has a combination of both types of changes of these you could modify your DataFrame up to the point that you return a new object.

Changes the original

df = pd.DataFrame([1,2,4])

def mutate_data(df):
    df.loc[1,0] = 7

mutate_data(df)
print(df)
#   0
#0  1
#1  7
#2  4

Will not change original

df = pd.DataFrame([1,2,4])

def mutate_data(df):
    df = df*2

mutate_data(df)
print(df)
#   0
#0  1
#1  2
#2  4

What should you do?

If the purpose of a function is to modify a DataFrame, like in a pipeline, then you should create a function that takes a DataFrame and returns the DataFrame.

def add_column(df):
    df['new_column'] = 7
    return df


df = add_column(df)
#┃              ┃
#┗ on lhs & rhs ┛

In this scenario it doesn't matter if the function changes or creates a new object, because we intend to modify the original anyway.

However, that may have unintended consequences if you plan to write to a new object

df1 = add_column(df)
# |              |
# New Obj        Function still modifies this though!

A safe alternative that would require no knowledge of the underlying source code would be to force your function to copy at the top. Thus in that scope changes to df do not impact the original df outside of the function.

def add_column_maintain_original(df):
    df = df.copy()

    df['new_column'] = 7
    return df

Another possibility is to pass a copy to the function:

df1 = add_column(df.copy())

How is Pandas DataFrame handled across multiple custom functions when passed as argument?

Answers (2)

Related Questions