Reputation: 543
As I understand, a copy by slicing copies the upper levels of a structure, but not the lower ones (I'm not sure when).
However, in this case I make a copy by slicing and, when editing two columns of the copy, one column of the original is altered, but the other is not.
How is it possible? Why one column, and not both or none of them?
Here is the code:
import pandas as pd
import numpy as np
url = 'https://raw.githubusercontent.com/udacity/deep-learning-v2-pytorch/master/intro-neural-networks/student-admissions/student_data.csv'
data = pd.read_csv(url)
# Copy data
processed_data = data[:]
print(data[:10])
# Edit copy
processed_data['gre'] = processed_data['gre']/800.0
processed_data['gpa'] = processed_data['gpa']/4.0
# gpa column has changed
print(data[:10])
On the other hand, if I change processed_data = data[:]
to processed_data = data.copy()
it works fine.
Here, the original data edited:
Upvotes: 0
Views: 220
Reputation:
As I understand, a copy by slicing copies the upper levels of a structure, but not the lower ones.
This is valid for Python lists. Slicing creates shallow copies.
In [44]: lst = [[1, 2], 3, 4]
In [45]: lst2 = lst[:]
In [46]: lst2[1] = 100
In [47]: lst # unchanged
Out[47]: [[1, 2], 3, 4]
In [48]: lst2[0].append(3)
In [49]: lst # changed
Out[49]: [[1, 2, 3], 3, 4]
However, this is not the case for numpy/pandas. numpy, for the most part, returns view when you slice an array.
In [50]: arr = np.array([1, 2, 3])
In [51]: arr2 = arr[:]
In [52]: arr2[0] = 100
In [53]: arr
Out[53]: array([100, 2, 3])
If you have a DataFrame with a single dtype, the behaviour you see is the same:
In [62]: df = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
In [63]: df
Out[63]:
0 1 2
0 1 2 3
1 4 5 6
In [64]: df2 = df[:]
In [65]: df2.iloc[0, 0] = 100
In [66]: df
Out[66]:
0 1 2
0 100 2 3
1 4 5 6
But when you have mixed dtypes, the behavior is not predictable which is the main source of the infamous SettingWithCopyWarning:
dfmi['one']['second'] = value # becomes dfmi.__getitem__('one').__setitem__('second', value)
See that
__getitem__
in there? Outside of simple cases, it’s very hard to predict whether it will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees), and therefore whether the__setitem__
will modify dfmi or a temporary object that gets thrown out immediately afterward. That’s what SettingWithCopy is warning you about!
In your case, my guess is that this was the result of how different dtypes are handled in pandas. Each dtype has its own block and in case of the gpa
column the block is the column itself. This is not the case for gre
-- you have other integer columns. When I add a string column to data
and modify it in processed_data
I see the same behavior. When I increase the number of float columns to 2 in data
, changing gre
in processed_data
no longer affects original data
.
In a nutshell, the behavior is the result of an implementation detail which you shouldn't rely on. If you want to copy DataFrames, you should explicitly use .copy()
and if you want to modify parts of DataFrames you shouldn't assign those parts to other variables. You should directly modify them either with .loc
or .iloc
.
Upvotes: 1