Roy
Roy

Reputation: 945

Python Pandas: how to pass "views" of DataFrames into function?

My scenario is that a function should be able to modify the values inside a pandas.DataFrame. But I don't want to expose the whole DataFrame to the function, just the parts that needs to be modified. A reason for this transparency is that the function will be more general w/ the ability to specify which portion of the DataFrame to be modified from outside. Imaging I can write a function mult(df_view, a) which multiplies all the values in the view by a. Notice that I don't want a new DataFrame to be created. The value change should be in-place.

This is my attempt:

df = pd.DataFrame([[1,1],[1,1]])

def mult(df_view, a):
    df_view *= a

mult(df.loc[1,1], 2)

print(df)

This is the (undesired) output:

   0  1
0  1  1
1  1  1

The expected output is:

   0  1
0  1  1
1  1  2

Notice that if we do the assignment directly (i.e. w/o the function), it works:

df = pd.DataFrame([[1,1],[1,1]])

df.loc[1,1] *= 2

print(df)

... gives:

   0  1
0  1  1
1  1  2

So, apparently I'm messing up something when passing that view through the function call. I have read this blog post from Jeff Knupp and I think I understand how python's name-object binding works. My understand of DataFrames is that when I call df.loc[1,1] it generates a proxy object which points to the original DataFrame w/ the [1,1] window so that further operations (e.g. assignment) goes to only elements inside the window. Now when I pass that df.loc[1,1] through the function call, the function binds the name df_view to the proxy object. So in my theory, any change (i.e. df_view *= a) should be applied to the view and thus the elements in the original DataFrame. From the result, clearly that's not happening and it seems the DataFrame is copied in the process (I'm not sure where) because some values were changed outside the original DataFrame.

Upvotes: 1

Views: 786

Answers (2)

miradulo
miradulo

Reputation: 29710

Just check

>>> type(df.loc[1, 1])
numpy.int64

So evidently this isn't going to work - you're passing in a single immutable int, which has no bind to the outer DataFrame.

Were you to pass in an actual view with simple indexing (mutable construct), it would most likely work.

>>> mult(df.loc[:, 1], 2)
>>> df
    0  1
0   1  2
1   1  2

But some other ops will not work.

>>> mult(df.loc[:, :1], 2)
>>> df
    0  1
0   1  2
1   1  2

All in all I think this control flow is a bad idea - a better option would be to operate directly on the index as you showed works. Pandas tends to be more friendly (IMHO) when you stick to immutability when possible.

Upvotes: 1

B. M.
B. M.

Reputation: 18668

The problem in in some case which is sometimes difficult to detect, copy of data are made.

You can get round the difficulty by indexing in the function :

def mult(df,i,j,a):
    df.loc[i,j]*=a

mult(df,1,1,2)
mult(df,1,slice(0,2),6)
print(df)

for

   0  1
0  1  1
1  6 12

Upvotes: 1

Related Questions