Reputation: 114
We have a project where we have multiple *.py scripts with functions that receive and return pandas dataframe variable(s) as arguments.
But this make me wonder: What is the behavior in memory of the dataframe variable when they are passed as argument or as returned variables from those functions?
Does modifying the df variable alters the parent/main/global variable as well?
Consider the following example:
import pandas as pd
def add_Col(df):
df["New Column"] = 10 * 3
def mod_Col(df):
df["Existing Column"] = df["Existing Column"] ** 2
data = [0,1,2,3]
df = pd.DataFrame(data,columns=["Existing Column"])
add_Col(df)
mod_col(df)
df
When df is displayed at the end: Will the new Column show up? what about the change made to "Existing Column" when calling mod_col? Did invoking add_Col function create a copy of df or only a pointer?
What is the best practice when passing dataframes into functions becuase if they are large enough I am sure creating copies will have both performance and memory implications right?
Upvotes: 4
Views: 1541
Reputation: 59579
It depends. DataFrames are mutable objects, so like lists, they can be modified within a function, without needing to return the object.
On the other hand, the vast majority of pandas operations will return a new object so modifications would not change the underlying DataFrame. For instance, below you can see that changing values with .loc
will modify the original, but if you were to multiply the entire DataFrame (which returns a new object) the original remains unchanged.
If you had a function that has a combination of both types of changes of these you could modify your DataFrame up to the point that you return a new object.
Changes the original
df = pd.DataFrame([1,2,4])
def mutate_data(df):
df.loc[1,0] = 7
mutate_data(df)
print(df)
# 0
#0 1
#1 7
#2 4
Will not change original
df = pd.DataFrame([1,2,4])
def mutate_data(df):
df = df*2
mutate_data(df)
print(df)
# 0
#0 1
#1 2
#2 4
What should you do?
If the purpose of a function is to modify a DataFrame, like in a pipeline, then you should create a function that takes a DataFrame and returns the DataFrame.
def add_column(df):
df['new_column'] = 7
return df
df = add_column(df)
#┃ ┃
#┗ on lhs & rhs ┛
In this scenario it doesn't matter if the function changes or creates a new object, because we intend to modify the original anyway.
However, that may have unintended consequences if you plan to write to a new object
df1 = add_column(df)
# | |
# New Obj Function still modifies this though!
A safe alternative that would require no knowledge of the underlying source code would be to force your function to copy at the top. Thus in that scope changes to df
do not impact the original df
outside of the function.
def add_column_maintain_original(df):
df = df.copy()
df['new_column'] = 7
return df
Another possibility is to pass a copy
to the function:
df1 = add_column(df.copy())
Upvotes: 4
Reputation: 121
yes the function will indeed change the data frame itself without creating a copy of it. You should be careful of it because you might end up having columns changed without you noticing.
In my opinion the best practice depend on use cases and using .copy() will indeed have an impact on your memory.
If for instance you are creating a pipeline with some dataframe as input you do not want to change the input dataframe itself. While if you are just processing a dataframe and you are splitting the processing in different function you can write the function how you did it
Upvotes: 3