Reputation: 21573
I'm doing some data analysis, and the data are in pandas DataFrame
, df
.
There are several function that I defined to do process on the df
.
For encapsulation purpose, I define the functions like this:
def df_process(df):
df=df.copy()
# do some process work on df
return df
In Jupyter Notebook, I use the function as
df = df_process(df)
The reason for using df.copy()
is that otherwise the original df
would be modified, whether you assign it back or not. (see Python & Pandas: How to return a copy of a dataframe?)
My question is:
Is using df=df.copy()
proper here? If not, how should a function handle data be defined?
Since I use several such data processing function, will it affect my program's performance? And how much?
Upvotes: 2
Views: 1097
Reputation: 43507
Far better would be:
def df_process(df):
# do some process work on df
def df_another(df):
# other processing
def df_more(df):
# yet more processing
def process_many(df):
for frame_function in (df_process, df_another, df_more):
df_copy = df.copy()
frame_function(df_copy)
# emit the results to a file or screen or whatever
The key here is if you must make a copy, make only one, process it, stash the results somewhere and then dispose of it by reassigning df_copy
. Your question made no mention of why you are hanging onto processed copies so this assumes you need not.
Upvotes: 1