Reputation: 33
I have a column in a pandas dataframe that itself holds numpy arrays (I know, probably not the best idea, but I'm curious now). Calling numpy.matrix
on a copy of this column changes the original dataframe:
import numpy as np
import pandas as pd
array = [np.array([1, 2])]
df = pd.DataFrame({
'arrays': array.copy() # creating a copy here...
})
df_backup = df.copy(deep=True) # ... and here
df
This returns what I would expect:
arrays
0 [1, 2]
A few things for comparison later:
>>> array
[array([1, 2])]
>>> array[0].shape
(2,)
Now I try converting this to a matrix. It doesn't do what I want it to do but my point stands that it changes data where it shouldn't as far as I understand:
>>> np.matrix(df.arrays.copy(), copy=True) # another copy
matrix([[array([[1],
[2]])]], dtype=object)
This is where things get weird:
>>> df
arrays
0 [[1], [2]]
So somehow my cell now holds an array where each element is an array with just one number whereas before it was a single array with two numbers. This happened even though I told np.matrix(..., copy=True)
and worked on a copy of my pandas Series: df.arrays.copy()
.
>>> df_backup
arrays
0 [[1], [2]]
Even the backup that I made early on has changed. I even used deep copy for that one.
And this is the part that confuses me the most: My original list is changed too. (Called .copy()
on that as well.)
>>> array
[array([[1],
[2]])]
>>> array[0].shape
(2, 1)
So now my question is, after all these copies, how is everything still linked and what else would I have to do to truly not change the original data?
Edit:
So it seems like the answer is that pandas only stores a reference to the numpy object as even in
from copy import deepcopy
df_backup = deepcopy(df)
df_backup
still gets modified.
The only way array
doesn't get modified is if I do something like
array_backup = deepcopy(array)
at the very beginning.
Upvotes: 0
Views: 59
Reputation: 231415
First, a list containing an array:
In [334]: alist = [np.array([1,2])]
A dataframe from that list:
In [335]: df = pd.DataFrame({'arrays':alist})
In [336]: df
Out[336]:
arrays
0 [1, 2]
A pd Series
:
In [337]: df.arrays
Out[337]:
0 [1, 2]
Name: arrays, dtype: object
An element of that Series
:
In [338]: df.arrays[0]
Out[338]: array([1, 2])
Make a matrix from that array - it's a copy (default parameter)
In [339]: mat = np.matrix(df.arrays[0])
In [340]: mat
Out[340]: matrix([[1, 2]])
In [341]: df
Out[341]:
arrays
0 [1, 2]
In [342]: alist
Out[342]: [array([1, 2])]
Make a matrix from the Series
:
In [343]: mat2 = np.matrix(df.arrays)
In [344]: mat2
Out[344]:
matrix([[array([[1],
[2]])]], dtype=object)
In [345]: alist
Out[345]:
[array([[1],
[2]])]
In [346]: mat2.shape
Out[346]: (1, 1)
mat2
is a (1,1) matrix (matrix is always 2d), object dtype - that is, it contains an object, in this case an array.
Creating mat2
has replaced the element in alist
with a (2,1) array. df
also has a pointer to this new array. (edit - digging further it appears that creating mat2
just reshaped the array in alist
.)
I'm not sure what created this (2,1) array, but I suspect it has something to do with how the Series
passes its elements to np.matrix
. In any case, you don't want to make a matrix directly from a Series
. You make it from an element of the Series.
Upvotes: 0