Fernando S. Peregrino
Fernando S. Peregrino

Reputation: 515

Apply function on a two dataframe rows

Given a pandas dataframe like this:

df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})

col1    col2
0   1   4
1   2   5
2   3   6

I would like to do something equivalent to this using a function but without passing "by value" or as a global variable the whole dataframe (it could be huge and then it would give me a memory error):

i = -1
for index, row in df.iterrows():
    if i < 0:
        i = index
        continue
    c1 = df.loc[i][0] + df.loc[index][0]
    c2 = df.loc[i][1] + df.loc[index][1]
    df.ix[index, 0] = c1
    df.ix[index, 1] = c2
    i = index

col1    col2
0   1   4
1   3   9
2   6   15

i.e., I would like to have a function which will give me the previous output:

def my_function(two_rows):
   row1 = two_rows[0]
   row2 = two_rows[1]
   c1 = row1[0] + row2[0]
   c2 = row1[1] + row2[1]
   row2[0] = c1
   row2[1] = c2
   return row2

df.apply(my_function, axis=1)
df

col1    col2
0   1   4
1   3   9
2   6   15

Is there a way of doing this?

Upvotes: 1

Views: 647

Answers (1)

piRSquared
piRSquared

Reputation: 294258

What you've demonstrated is a cumsum

df.cumsum()

   col1  col2
0     1     4
1     3     9
2     6    15
def f(df):
    n = len(df)
    r = range(1, n)
    for j in df.columns:
        for i in r:
            df[j].values[i] += df[j].values[i - 1]

    return df

f(df)

To define a function as a loop that does this in place

Slow cell by cell

def f(df):
    n = len(df)
    r = range(1, n)
    for j in df.columns:
        for i in r:
            df[j].values[i] += df[j].values[i - 1]

    return df

f(df)

   col1  col2
0     1     4
1     3     9
2     6    15

Compromise between memory and efficiency

def f(df):
    for j in df.columns:
        df[j].values[:] = df[j].values.cumsum()

    return df

f(df)

f(df)

   col1  col2
0     1     4
1     3     9
2     6    15

Note that you don't need to return df. I chose to for convenience.

Upvotes: 1

Related Questions