ts91
ts91

Reputation: 33

numpy.matrix ignores copy

I have a column in a pandas dataframe that itself holds numpy arrays (I know, probably not the best idea, but I'm curious now). Calling numpy.matrix on a copy of this column changes the original dataframe:

import numpy as np
import pandas as pd

array = [np.array([1, 2])]

df = pd.DataFrame({
    'arrays': array.copy()  # creating a copy here...
})
df_backup = df.copy(deep=True)  # ... and here
df

This returns what I would expect:

   arrays
0  [1, 2]

A few things for comparison later:

>>> array
[array([1, 2])]
>>> array[0].shape
(2,)

Now I try converting this to a matrix. It doesn't do what I want it to do but my point stands that it changes data where it shouldn't as far as I understand:

>>> np.matrix(df.arrays.copy(), copy=True)  # another copy
matrix([[array([[1],
       [2]])]], dtype=object)

This is where things get weird:

>>> df
       arrays
0  [[1], [2]]

So somehow my cell now holds an array where each element is an array with just one number whereas before it was a single array with two numbers. This happened even though I told np.matrix(..., copy=True) and worked on a copy of my pandas Series: df.arrays.copy().

>>> df_backup
       arrays
0  [[1], [2]]

Even the backup that I made early on has changed. I even used deep copy for that one.

And this is the part that confuses me the most: My original list is changed too. (Called .copy() on that as well.)

>>> array
[array([[1],
       [2]])]
>>> array[0].shape
(2, 1)

So now my question is, after all these copies, how is everything still linked and what else would I have to do to truly not change the original data?

Edit:

So it seems like the answer is that pandas only stores a reference to the numpy object as even in

from copy import deepcopy
df_backup = deepcopy(df)

df_backup still gets modified.

The only way array doesn't get modified is if I do something like

array_backup = deepcopy(array)

at the very beginning.

Upvotes: 0

Views: 59

Answers (1)

hpaulj
hpaulj

Reputation: 231415

First, a list containing an array:

In [334]: alist = [np.array([1,2])]

A dataframe from that list:

In [335]: df = pd.DataFrame({'arrays':alist})
In [336]: df
Out[336]: 
   arrays
0  [1, 2]

A pd Series:

In [337]: df.arrays
Out[337]: 
0    [1, 2]
Name: arrays, dtype: object

An element of that Series:

In [338]: df.arrays[0]
Out[338]: array([1, 2])

Make a matrix from that array - it's a copy (default parameter)

In [339]: mat = np.matrix(df.arrays[0])
In [340]: mat
Out[340]: matrix([[1, 2]])
In [341]: df
Out[341]: 
   arrays
0  [1, 2]
In [342]: alist
Out[342]: [array([1, 2])]

Make a matrix from the Series:

In [343]: mat2 = np.matrix(df.arrays)
In [344]: mat2
Out[344]: 
matrix([[array([[1],
       [2]])]], dtype=object)
In [345]: alist
Out[345]: 
[array([[1],
        [2]])]
In [346]: mat2.shape
Out[346]: (1, 1)

mat2 is a (1,1) matrix (matrix is always 2d), object dtype - that is, it contains an object, in this case an array.

Creating mat2 has replaced the element in alist with a (2,1) array. df also has a pointer to this new array. (edit - digging further it appears that creating mat2 just reshaped the array in alist.)

I'm not sure what created this (2,1) array, but I suspect it has something to do with how the Series passes its elements to np.matrix. In any case, you don't want to make a matrix directly from a Series. You make it from an element of the Series.

Upvotes: 0

Related Questions