ZK Zhao
ZK Zhao

Reputation: 21573

Python & Pandas: Will using many df.copy affect code's performance?

I'm doing some data analysis, and the data are in pandas DataFrame, df.

There are several function that I defined to do process on the df.

For encapsulation purpose, I define the functions like this:

def df_process(df):
    df=df.copy()
    # do some process work on df
    return df

In Jupyter Notebook, I use the function as

df = df_process(df)

The reason for using df.copy() is that otherwise the original df would be modified, whether you assign it back or not. (see Python & Pandas: How to return a copy of a dataframe?)

My question is:

  1. Is using df=df.copy() proper here? If not, how should a function handle data be defined?

  2. Since I use several such data processing function, will it affect my program's performance? And how much?

Upvotes: 2

Views: 1097

Answers (1)

msw
msw

Reputation: 43507

Far better would be:

def df_process(df):
    # do some process work on df

def df_another(df):
    # other processing

def df_more(df):
    # yet more processing

def process_many(df):
    for frame_function in (df_process, df_another, df_more):
        df_copy = df.copy()
        frame_function(df_copy)
        # emit the results to a file or screen or whatever

The key here is if you must make a copy, make only one, process it, stash the results somewhere and then dispose of it by reassigning df_copy. Your question made no mention of why you are hanging onto processed copies so this assumes you need not.

Upvotes: 1

Related Questions