James Gessel
James Gessel

Reputation: 39

Pandas- Function is overwriting original DF even though I am maniuplating copy?

I am creating a function to categorize data in bins in a df. I have made the function, and am first extracting numbers from a string, and replacing the column of text with a column of numbers.

The function is somehow overwriting the original dataframe, despite me only manipulating a copy of it.

def categorizeColumns(df):

    newdf = df
 
    if 'Runtime' in newdf.columns:
        for row in range(len(newdf['Runtime'])):
            strRuntime = newdf['Runtime'][row]
            numsRuntime = [int(i) for i in strRuntime.split() if i.isdigit()]
            newdf.loc[row,'Runtime'] = numsRuntime[0]
    
return newdf

df = pd.read_csv('moviesSeenRated.csv')
newdf = categorizeColumns(df)

The original df has a column of runtimes like this [34 mins, 32 mins, 44 mins] etc, and the newdf should have [33,32,44], which it does. However, the original df also changes outside the function.

Whats going on here? Any fixes? Thanks in advance.

EDIT: Seems like I wasn't making a copy, I needed to do

df.copy()

Thank you all!

Upvotes: 3

Views: 1398

Answers (2)

Satrio Adi Prabowo
Satrio Adi Prabowo

Reputation: 600

I think you are not making a copy of dataframe. What you did on newdf = df is called reference.

You have to .copy() your dataframe.

def categorizeColumns(df):
    newdf = df.copy()
 
    if 'Runtime' in newdf.columns:
        for row in range(len(newdf['Runtime'])):
            strRuntime = newdf['Runtime'][row]
            numsRuntime = [int(i) for i in strRuntime.split() if i.isdigit()]
            newdf.loc[row,'Runtime'] = numsRuntime[0]
    return newdf

df = pd.read_csv('moviesSeenRated.csv')
newdf = categorizeColumns(df)

Upvotes: 2

izhang05
izhang05

Reputation: 764

The problem is that you aren't actually making a copy of the dataframe in the line newdf = df. To make a copy, you could do newdf = df.copy().

Upvotes: 2

Related Questions