Reputation: 39
I am creating a function to categorize data in bins in a df. I have made the function, and am first extracting numbers from a string, and replacing the column of text with a column of numbers.
The function is somehow overwriting the original dataframe, despite me only manipulating a copy of it.
def categorizeColumns(df):
newdf = df
if 'Runtime' in newdf.columns:
for row in range(len(newdf['Runtime'])):
strRuntime = newdf['Runtime'][row]
numsRuntime = [int(i) for i in strRuntime.split() if i.isdigit()]
newdf.loc[row,'Runtime'] = numsRuntime[0]
return newdf
df = pd.read_csv('moviesSeenRated.csv')
newdf = categorizeColumns(df)
The original df has a column of runtimes like this [34 mins, 32 mins, 44 mins] etc, and the newdf should have [33,32,44], which it does. However, the original df also changes outside the function.
Whats going on here? Any fixes? Thanks in advance.
EDIT: Seems like I wasn't making a copy, I needed to do
df.copy()
Thank you all!
Upvotes: 3
Views: 1398
Reputation: 600
I think you are not making a copy of dataframe. What you did on newdf = df
is called reference.
You have to .copy()
your dataframe.
def categorizeColumns(df):
newdf = df.copy()
if 'Runtime' in newdf.columns:
for row in range(len(newdf['Runtime'])):
strRuntime = newdf['Runtime'][row]
numsRuntime = [int(i) for i in strRuntime.split() if i.isdigit()]
newdf.loc[row,'Runtime'] = numsRuntime[0]
return newdf
df = pd.read_csv('moviesSeenRated.csv')
newdf = categorizeColumns(df)
Upvotes: 2
Reputation: 764
The problem is that you aren't actually making a copy of the dataframe in the line newdf = df
. To make a copy, you could do newdf = df.copy()
.
Upvotes: 2