James Wright
James Wright

Reputation: 1425

Append DataFrame inside Function

I have a function test that takes a DataFrame and appends data to it. I want the global variable placed into the function to be changed. I have the script below:

import pandas as pd
global dff

def test(df):
    df = df.append({'asdf':1, 'sdf':2}, ignore_index=True)
    return(df)

dff = pd.DataFrame()
test(dff)

After this, dff remains empty; it was not edited. However, if you do this:

import pandas as pd

def test(df):
    df['asdf'] = [1,2,3]
    return(df)

dff = pd.DataFrame()
test(dff)

dff will have [1,2,3] under the column 'asfd'. Notice that I didn't even have to declare the variable as global.

Why is this happening?

I actually would like to know, because the second I think I understand variable workspaces, I'm proven wrong and I'm getting sick and tired of constantly running into this BS*

I know the solution to the problem is:

import pandas as pd

def test(df):
    df = df.append({'asdf':1, 'sdf':2}, ignore_index=True)
    return(df)

dff = pd.DataFrame()
dff = test(dff)

but I'm really just trying to figure out why the initial method isn't working, especially in light of the second script I've shown.

*obviously it's not complete BS, but I can't understand it after 3 years of casual programming

Upvotes: 3

Views: 13701

Answers (2)

James Wright
James Wright

Reputation: 1425

Update:

I found a very nice talk at PyCon 2015 that explains what I'm attempting to explain below, but with diagrams that make it significantly clearer. I'll leave the explanation below to explain how the original 3 scripts work, but I'd suggest going to watch the video:

Ned Batchelder - Facts and Myths about Python names and values - PyCon 2015


So, I think I've figured out what is happening in the two scripts above. I'll trying a break it down. Feel free to correct me if need be.

Few rules:

  1. Variables are names of links/pointers to an underlying object that actually holds the data. For example, street addresses. A street address is not a house; it simply points to a house. So the address (101 Streetway Rd.) is the pointer. In a GPS, you might have it labeled as "Home". The word "Home" would be the variable itself.

  2. Functions work on objects, not variables or pointers. When you pass a variable to a function, you are actually passing the object, not the variable or pointer. Continuing the house example, if you want to add a deck to a house, you want to the decking contractors to work on the house, not the meta-physical address.

  3. The return command in a function returns an pointer to an object. So this would be the address of the house, not the house or the name you might call your house.

  4. = is a function meaning 'point to this object'. The variable in front of the = is the output, the variable to the right is the input. This would be the act of naming a house. So Home = 101 Streetway Rd. makes the variable Home point to the house on 101 Streetway Rd. Let's say you moved into your neighbors house, which is 102 Streetway Rd. This could be done by Home = Neighbor's House. Now, Home is now the name of the pointer 102 Streetway Rd.

Here on out, I'll use ---> to mean "points to"

Before we get to the Scripts let's start with what we want. We want the object objdff pointed to by a varia

Script 1:

(without the global dff as that doesn't do anything relevant)

import pandas as pd

def test(df):
    df = df.append({'asdf':1, 'sdf':2}, ignore_index=True)
    return(df)

dff = pd.DataFrame()
test(dff)

So let's walk through the function. Nothing interesting happens until we get to:

dff = pd.DataFrame()

Here, we have the varible dff being assigned to the object created by pd.DataFrame, which is an empty dataframe. We'll call this object objdff. So at the end of this line, we have dff ---> objdff.

Next line: test(dff)

Functions work on objects, so we're saying that we're going to run the function test on the object that dff points to, which is objdff. This brings us to the function itself.

def test(df):

Here, we have what is essentially an = function. The object passed to the test function objdff is pointed to by the function variable df. So now df --->objdff and dff---> objdff

Moving onto the next line: df = df.append(...)

Let's start with df.append(...). The .append(...) is passed onto the objdff. This makes the object objdff run a function called 'append'. As pointed out by @Jai, the .append(...) method uses a return command to output an entirely new DataFrame that has the data appended to it. We'll call the new object objdff_apnd.

Now we can move onto the df = ... part. What we have now is essentially df = objdff_apnd. This is pretty simple now. The variable df now points to the object objdff_apnd.

At the end of this line we have df ---> objdff_apnd and dff ---> objdff. This is where the problem lies. The object we want (objdff_apnd) is not being pointed to by dff.

So at the end, the variable dff is still pointing to objdff, not to objdff_apnd. This brings us to Script 3 (see below).

Script 2:

import pandas as pd

def test(df):
    df['asdf'] = [1,2,3]
    return(df)

dff = pd.DataFrame()
test(dff)

Just like Script 1, dff ---> objdff. During test(dff), the function variable df ---> objdff. This is where things are different.

The operation (?) df['asdf'] = [1,2,3] again, is sent to the underlying object objdff. Last time, this resulted in a new object. This time however, the ['asdf'] operation directly edits the object objdff. So the object objdff has the extra 'asdf' column in it.

Therefore at the end we have df ---> objdff and dff ---> objdff. So they point to the same object, which means the variable dff points to the edited object.

Once we break outside of the function, variable dff still points to objdff, which has the new data in it. This gives us the desired result.

Script 3:

import pandas as pd

def test(df):
    df = df.append({'asdf':1, 'sdf':2}, ignore_index=True)
    return(df)

dff = pd.DataFrame()
dff = test(dff)

This script is exactly identical to Script 1, except for the dff = test(dff). We'll get to that in a second.

Continuing from the end of Script 1, we left off right as the function test(dff) was ending, and we have dff ---> objdff and df ---> objdff_apnd.

The function test has the return command, and so returns the object objdff_apnd. This turns the line dff = test(dff) into dff = objdff_apnd.

Therefore at the end, we have dff ---> objdff_apnd, which is exactly the result we want.

Upvotes: 4

Jai
Jai

Reputation: 3310

  • I think pandas data frame, list and dictionary all these data types are passed by reference to the function and hence, this behavior.
  • In the first script you are appending which is being appended to a whole new object as append returns a new object and hence it did not fill the original dataframe.
  • In the second script you are assigning a particular dataframe column to original dataframe object and hence it filled the original dataframe object with column because you are modifying the original object
  • you can check out this answer: python pandas dataframe, is it pass-by-value or pass-by-reference
  • Check this list example:

    def test1(a):
        a.append(1)
    
    def test2(a):
        a = [1, 2, 3]
    
    def test3(a):
        a[0] = 10
    
    aa = list()
    test1(aa)
    print(aa)
    
    aa = list()
    test2(aa)
    print(aa)
    
    aa = list([1])
    test3(aa)
    print(aa)
    
  • Output:

    [1]
    []
    [10]
    
  • Relate this above list example with the pandas dataframe example
  • If you check the append function of Dataframe:
    DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=None)[source] Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.
  • As you can see in the description that append returns a new object
  • The way you are using global keyword is wrong... I think even if you do not have global in the first script, still it won't make any difference... I do not the details about the global keyword so I will not mention anything about it.. But I know how to use the keyword and that is definitely not the right way to use it

Upvotes: 2

Related Questions