Barrendeitor
Barrendeitor

Reputation: 543

Why is a copy of a pandas object altering one column on the original object? (Slice copy)

As I understand, a copy by slicing copies the upper levels of a structure, but not the lower ones (I'm not sure when).

However, in this case I make a copy by slicing and, when editing two columns of the copy, one column of the original is altered, but the other is not.

How is it possible? Why one column, and not both or none of them?

Here is the code:

import pandas as pd
import numpy as np
url = 'https://raw.githubusercontent.com/udacity/deep-learning-v2-pytorch/master/intro-neural-networks/student-admissions/student_data.csv'
data = pd.read_csv(url)

# Copy data
processed_data = data[:]
print(data[:10])

# Edit copy
processed_data['gre'] = processed_data['gre']/800.0
processed_data['gpa'] = processed_data['gpa']/4.0

# gpa column has changed
print(data[:10])

On the other hand, if I change processed_data = data[:] to processed_data = data.copy() it works fine.

Here, the original data edited:

recreation

Upvotes: 0

Views: 220

Answers (1)

user2285236
user2285236

Reputation:

As I understand, a copy by slicing copies the upper levels of a structure, but not the lower ones.

This is valid for Python lists. Slicing creates shallow copies.

In [44]: lst = [[1, 2], 3, 4]                                                      

In [45]: lst2 = lst[:]                                                             

In [46]: lst2[1] = 100                                                             

In [47]: lst  # unchanged                                                          
Out[47]: [[1, 2], 3, 4]

In [48]: lst2[0].append(3)                                                         

In [49]: lst  # changed                                                            
Out[49]: [[1, 2, 3], 3, 4]

However, this is not the case for numpy/pandas. numpy, for the most part, returns view when you slice an array.

In [50]: arr = np.array([1, 2, 3])                                                 

In [51]: arr2 = arr[:]                                                             

In [52]: arr2[0] = 100                                                             

In [53]: arr                                                                       
Out[53]: array([100,   2,   3])

If you have a DataFrame with a single dtype, the behaviour you see is the same:

In [62]: df = pd.DataFrame([[1, 2, 3], [4, 5, 6]])                                 

In [63]: df                                                                        
Out[63]: 
   0  1  2
0  1  2  3
1  4  5  6

In [64]: df2 = df[:]                                                               

In [65]: df2.iloc[0, 0] = 100                                                      

In [66]: df                                                                        
Out[66]: 
     0  1  2
0  100  2  3
1    4  5  6

But when you have mixed dtypes, the behavior is not predictable which is the main source of the infamous SettingWithCopyWarning:

dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)

See that __getitem__ in there? Outside of simple cases, it’s very hard to predict whether it will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees), and therefore whether the __setitem__ will modify dfmi or a temporary object that gets thrown out immediately afterward. That’s what SettingWithCopy is warning you about!

In your case, my guess is that this was the result of how different dtypes are handled in pandas. Each dtype has its own block and in case of the gpa column the block is the column itself. This is not the case for gre -- you have other integer columns. When I add a string column to data and modify it in processed_data I see the same behavior. When I increase the number of float columns to 2 in data, changing gre in processed_data no longer affects original data.

In a nutshell, the behavior is the result of an implementation detail which you shouldn't rely on. If you want to copy DataFrames, you should explicitly use .copy() and if you want to modify parts of DataFrames you shouldn't assign those parts to other variables. You should directly modify them either with .loc or .iloc.

Upvotes: 1

Related Questions