numpy.matrix ignores copy

Question

I have a column in a pandas dataframe that itself holds numpy arrays (I know, probably not the best idea, but I'm curious now). Calling numpy.matrix on a copy of this column changes the original dataframe:

import numpy as np
import pandas as pd

array = [np.array([1, 2])]

df = pd.DataFrame({
    'arrays': array.copy()  # creating a copy here...
})
df_backup = df.copy(deep=True)  # ... and here
df

This returns what I would expect:

   arrays
0  [1, 2]

A few things for comparison later:

>>> array
[array([1, 2])]
>>> array[0].shape
(2,)

Now I try converting this to a matrix. It doesn't do what I want it to do but my point stands that it changes data where it shouldn't as far as I understand:

>>> np.matrix(df.arrays.copy(), copy=True)  # another copy
matrix([[array([[1],
       [2]])]], dtype=object)

This is where things get weird:

>>> df
       arrays
0  [[1], [2]]

So somehow my cell now holds an array where each element is an array with just one number whereas before it was a single array with two numbers. This happened even though I told np.matrix(..., copy=True) and worked on a copy of my pandas Series: df.arrays.copy().

>>> df_backup
       arrays
0  [[1], [2]]

Even the backup that I made early on has changed. I even used deep copy for that one.

And this is the part that confuses me the most: My original list is changed too. (Called .copy() on that as well.)

>>> array
[array([[1],
       [2]])]
>>> array[0].shape
(2, 1)

So now my question is, after all these copies, how is everything still linked and what else would I have to do to truly not change the original data?

Edit:

So it seems like the answer is that pandas only stores a reference to the numpy object as even in

from copy import deepcopy
df_backup = deepcopy(df)

df_backup still gets modified.

The only way array doesn't get modified is if I do something like

array_backup = deepcopy(array)

at the very beginning.

numpy.matrix ignores copy

Answers (1)

Related Questions