Reputation: 1425
I have a function test
that takes a DataFrame and appends data to it. I want the global variable placed into the function to be changed. I have the script below:
import pandas as pd
global dff
def test(df):
df = df.append({'asdf':1, 'sdf':2}, ignore_index=True)
return(df)
dff = pd.DataFrame()
test(dff)
After this, dff
remains empty; it was not edited. However, if you do this:
import pandas as pd
def test(df):
df['asdf'] = [1,2,3]
return(df)
dff = pd.DataFrame()
test(dff)
dff
will have [1,2,3]
under the column 'asfd'
. Notice that I didn't even have to declare the variable as global
.
Why is this happening?
I actually would like to know, because the second I think I understand variable workspaces, I'm proven wrong and I'm getting sick and tired of constantly running into this BS*
I know the solution to the problem is:
import pandas as pd
def test(df):
df = df.append({'asdf':1, 'sdf':2}, ignore_index=True)
return(df)
dff = pd.DataFrame()
dff = test(dff)
but I'm really just trying to figure out why the initial method isn't working, especially in light of the second script I've shown.
*obviously it's not complete BS, but I can't understand it after 3 years of casual programming
Upvotes: 3
Views: 13701
Reputation: 1425
I found a very nice talk at PyCon 2015 that explains what I'm attempting to explain below, but with diagrams that make it significantly clearer. I'll leave the explanation below to explain how the original 3 scripts work, but I'd suggest going to watch the video:
Ned Batchelder - Facts and Myths about Python names and values - PyCon 2015
So, I think I've figured out what is happening in the two scripts above. I'll trying a break it down. Feel free to correct me if need be.
Few rules:
Variables are names of links/pointers to an underlying object that actually holds the data. For example, street addresses. A street address is not a house; it simply points to a house. So the address (101 Streetway Rd.) is the pointer. In a GPS, you might have it labeled as "Home". The word "Home" would be the variable itself.
Functions work on objects, not variables or pointers. When you pass a variable to a function, you are actually passing the object, not the variable or pointer. Continuing the house example, if you want to add a deck to a house, you want to the decking contractors to work on the house, not the meta-physical address.
The return
command in a function returns an pointer to an object. So this would be the address of the house, not the house or the name you might call your house.
=
is a function meaning 'point to this object'. The variable in front of the =
is the output, the variable to the right is the input. This would be the act of naming a house. So Home = 101 Streetway Rd.
makes the variable Home
point to the house on 101 Streetway Rd. Let's say you moved into your neighbors house, which is 102 Streetway Rd. This could be done by Home = Neighbor's House
. Now, Home
is now the name of the pointer 102 Streetway Rd.
Here on out, I'll use --->
to mean "points to"
Before we get to the Scripts let's start with what we want. We want the object objdff
pointed to by a varia
(without the global dff
as that doesn't do anything relevant)
import pandas as pd def test(df): df = df.append({'asdf':1, 'sdf':2}, ignore_index=True) return(df) dff = pd.DataFrame() test(dff)
So let's walk through the function. Nothing interesting happens until we get to:
dff = pd.DataFrame()
Here, we have the varible dff
being assigned to the object created by pd.DataFrame
, which is an empty dataframe. We'll call this object objdff
. So at the end of this line, we have dff ---> objdff
.
Next line: test(dff)
Functions work on objects, so we're saying that we're going to run the function test
on the object that dff
points to, which is objdff
. This brings us to the function itself.
def test(df):
Here, we have what is essentially an =
function. The object passed to the test function objdff
is pointed to by the function variable df
. So now df --->objdff
and dff---> objdff
Moving onto the next line: df = df.append(...)
Let's start with df.append(...)
. The .append(...)
is passed onto the objdff
. This makes the object objdff
run a function called 'append'. As pointed out by @Jai, the .append(...)
method uses a return
command to output an entirely new DataFrame that has the data appended to it. We'll call the new object objdff_apnd
.
Now we can move onto the df = ...
part. What we have now is essentially df = objdff_apnd
. This is pretty simple now. The variable df
now points to the object objdff_apnd
.
At the end of this line we have df ---> objdff_apnd
and dff ---> objdff
. This is where the problem lies. The object we want (objdff_apnd
) is not being pointed to by dff
.
So at the end, the variable dff
is still pointing to objdff
, not to objdff_apnd
. This brings us to Script 3 (see below).
import pandas as pd def test(df): df['asdf'] = [1,2,3] return(df) dff = pd.DataFrame() test(dff)
Just like Script 1, dff ---> objdff
. During test(dff)
, the function variable df ---> objdff
. This is where things are different.
The operation (?) df['asdf'] = [1,2,3]
again, is sent to the underlying object objdff
. Last time, this resulted in a new object. This time however, the ['asdf']
operation directly edits the object objdff
. So the object objdff
has the extra 'asdf' column in it.
Therefore at the end we have df ---> objdff
and dff ---> objdff
. So they point to the same object, which means the variable dff
points to the edited object.
Once we break outside of the function, variable dff
still points to objdff
, which has the new data in it. This gives us the desired result.
import pandas as pd def test(df): df = df.append({'asdf':1, 'sdf':2}, ignore_index=True) return(df) dff = pd.DataFrame() dff = test(dff)
This script is exactly identical to Script 1, except for the dff = test(dff)
. We'll get to that in a second.
Continuing from the end of Script 1, we left off right as the function test(dff)
was ending, and we have dff ---> objdff
and df ---> objdff_apnd
.
The function test
has the return
command, and so returns the object objdff_apnd
. This turns the line dff = test(dff)
into dff = objdff_apnd
.
Therefore at the end, we have dff ---> objdff_apnd
, which is exactly the result we want.
Upvotes: 4
Reputation: 3310
append
returns a new object and hence it did not fill the original dataframe. Check this list example:
def test1(a):
a.append(1)
def test2(a):
a = [1, 2, 3]
def test3(a):
a[0] = 10
aa = list()
test1(aa)
print(aa)
aa = list()
test2(aa)
print(aa)
aa = list([1])
test3(aa)
print(aa)
Output:
[1]
[]
[10]
append
function of Dataframe:DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=None)[source]
Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.
append
returns a new objectglobal
keyword is wrong... I think even if you do not have global
in the first script, still it won't make any difference... I do not the details about the global
keyword so I will not mention anything about it.. But I know how to use the keyword and that is definitely not the right way to use itUpvotes: 2