mstaal
mstaal

Reputation: 630

How to do recursive vectorized calculations in pandas DataFrame via columns?

Edit: I have altered the sample data so that the 5th row is now gone due to an error in the sample.

Assume that we a directed graph G = (V, E) of edges E and vertices V. Assume that we have a Pandas DataFrame describing which nodes (u, v)are connected to each other and the value/weight of the corresponding edge e. Let the following be a representation of such a DataFrame.

#   from   to   weight
-----------------------
0     0     1     1.0
1     1     2     0.5
2     2     3     0.2     
3     0     4     1.3
4     4     5     0.9  

Is it possible to somehow add a column with accumulated weights, such that for instance row 2 has an accumulated weight value of 1.7=0.2+0.5+1.0 since we have a path 0->1->2->3? Preferably in a vectorized so that the calculation scales. In other words, we should get the following DataFrame.

#   from   to   weight    accumulated
-------------------------------------
0     0     1     1.0      1.0
1     1     2     0.5      1.5
2     2     3     0.2      1.7   
3     0     4     1.3      1.3
4     4     5     0.9      2.2

We can assume that there is no other path to vertex 3 since the DataFrame is made such that only shortest paths are included.

I have thus far written the following piece of code that uses DataFrame.apply, which is not a vectorized approach. Here I store / cache previously calculated accumulated values in a dictionary called accum_map.

def __set_accum(self, row):
    search = row["to"]
    if search in self.accum_map:
        return self.accum_map[search]
    from_node = row["from"]
    old_from = self.df[self.df["to"] == from_node].get("from")
    old_from = None if old_from.empty else old_from.values[0]
    weight = row["weight"]
    self.accum_map[search] = self.__set_accum({"to": from, "from": old_from}) + weight
    return self.accum_mapp[search]

def set_accumulated(self):
    self.df["accumulated"] = self.df.apply(func=self.__set_timestamp, axis=1)

Upvotes: 0

Views: 108

Answers (1)

PTQuoc
PTQuoc

Reputation: 1083

There is some issue with your dataset. I re-create another example here:

import pandas as pd
import numpy as np

df = pd.DataFrame({'from':[0,2,3,1,0],
                   'to':[1,2,4,4,2],
                   'val':[10,20,30,40,50]})

Here if you do not have to groupby, then it is easier to exact series outside, perform indexing sum then add back to main dataframe.

# Extract value:
s = df['val']
st = df['from']
gt = (df['to']+1)

# Perform cumsum by 
out = list()
for i,j in zip(st,gt):
    out.append(s[i:j].sum())

# Add `new` col from result
df['new'] = out

Upvotes: 0

Related Questions