Benjamin Lenglet
Benjamin Lenglet

Reputation: 53

Summing a transaction chain in a dataframe, rows linked by column values

I'm trying to link multiple rows from a DataFrame in order to get all possible paths formed by connecting receiver ids to sender ids.

Here is an example of my DataFrame:

   transaction_id sender_id receiver_id  amount
0          213234       002         125      10
1          223322       017         354      90
2          343443       125         689      70
3          324433       689         233       5
4          328909       354         456      10

created with:

df = pd.DataFrame(
    {'transaction_id': {0: '213234', 1: '223322', 2: '343443', 3: '324433', 4: '328909'},
     'sender_id': {0: '002', 1: '017', 2: '125', 3: '689', 4: '354'},
     'receiver_id': {0: '125', 1: '354', 2: '689', 3: '233', 4: '456'},
     'amount': {0: 10, 1: 90, 2: 70, 3: 5, 4: 10}}
)

Result of my code should be list of chained ids and the total amount for the transaction chain. For the first two rows in the above example, that's something like:

[('002', '125', '689', '233'), 85]
[('017', '354', '456'), 100]

I already tried to iterate through the rows and convert each row to an instance of a Node class, and then used methods for traversing a linked list, but I have no idea what the next step is here:

class Node:
    def __init__(self,transaction_id,sender,receiver,amount):
        self.transac = transaction_id
        self.val = sender_id
        self.next = receiver_id
        self.amount = amount
    def traverse(self):
        node = self # start from the head node
        while node != None:
            print (node.val) # access the node value
            node = node.next # move on to the next node

for index, row in customerTransactionSqlDf3.iterrows():
    index = Node( 
        row["transaction_id"],
        row["sender_id"],
        row["receiver_id"],
        row["amount"]
    )

Additional information:

Upvotes: 5

Views: 219

Answers (2)

dee cue
dee cue

Reputation: 1043

I have no idea what the next step is here

By using your current implementation, you can connect the two Node objects by iterating each nodes. You can also add visited property in the Node class so that you can identify unique chain as you traverse through the tree i.e. there is not one chain that is a sub-chain of another chain. However, if you want to know the chain for each sender_id, this may be not necessary.

Edit: I noticed that you mentioned the example of the expected result is for the first two rows. This implies that each sender_id should have their own chain. Modifying the traverse method so that it can be used after the nodes are all connected.

Edit: Reimplementing visited property to get unique chain

df = pd.DataFrame(
    {'transaction_id': {0: '213234', 1: '223322', 2: '343443', 3: '324433', 4: '328909'},
     'sender_id': {0: '002', 1: '017', 2: '125', 3: '689', 4: '354'},
     'receiver_id': {0: '125', 1: '354', 2: '689', 3: '233', 4: '456'},
     'amount': {0: 10, 1: 90, 2: 70, 3: 5, 4: 10}}
)

class Node:
    def __init__(self,transaction_id,sender_id,receiver_id,amount):
        self.transac = transaction_id
        self.sender = sender_id
        self.receiver = receiver_id
        self.next = None
        self.amount = amount
        self.visited = False
    def traverse(self, chain=None, total=0):
        if (self.visited): # undo visited nodes
            return
        self.visited = True
        if chain is None: # this is the beginning of the traversal
            chain = [self.sender]
        chain += [self.receiver]
        total += self.amount
        if self.next is not None:
            return self.next.traverse(chain, total)
        return chain, total

transc = [Node( 
        row["transaction_id"],
        row["sender_id"],
        row["receiver_id"],
        row["amount"]
    ) for i, row in df.iterrows()]

# connect the nodes
for i, v in enumerate(transc):
    for j, k in enumerate(transc):
        # if the receiver v same as the sender from j
        if v.receiver == k.sender:
            v.next = k


summary = [i.traverse() for i in transc]
summary = [i for i in summary if i is not None] # removing None

print(summary)

The output:

[
    (['002', '125', '689', '233'], 85), 
    (['017', '354', '456'], 100)
]

Upvotes: 1

Martijn Pieters
Martijn Pieters

Reputation: 1121406

You have a directed graph, with the edges formed by id -> id connections. You are trying to enumerate all paths through this graph. This is actually much easier by not using linked lists.

Note that your linked list implementation doesn't actually link the nodes; your next values would have to be referencing other Node instances, not an id.

Because your paths can't have loops, the graph is said to be acyclic. Your paths are really simple too, as you say that there is never more than one receiver id per sender id.

Create a new view into your dataframe with the sender_id as the index, together with the receiver id and amount columns; this will make it very easy to find the next path elements. You can then iterate over these columns and traverse their paths and sum their amounts, trivially. The following code uses already found paths to avoid having to traverse those paths again:

# receiver and amount rows, indexed by sender
edges = df[['sender_id', 'receiver_id', 'amount']].set_index('sender_id')
paths = {}   # sender -> [sender, receiver, receiver, receiver, ...]
totals = {}  # sender -> total amount

for sender, next_, amount in edges.itertuples():
    path = paths[sender] = [sender, next_]
    totals[sender] = amount
    while True:
        if next_ in paths:
            # re-use already found path
            path += paths[next_]
            totals[sender] += totals[next_]
            break

        try:
            next_, amount = edges.loc[next_]
        except KeyError:
            break  # path complete

        path.append(next_)
        totals[sender] += amount

The code could be made more efficient still by updating every sub-path encountered, so when you process the 3rd row for sender id 125, you already have that path processed because you had to traverse it for the path starting at002 for the first row:

for sender, next_, amount in edges.itertuples():
    if sender in paths:
        # already handled as part of a longer path
        continue

    paths[sender], totals[sender] = [sender, next_], amount
    senders = [sender]  # all sender ids along the path

    while True:
        if next_ in paths:
            # re-use already found path
            for sender in senders:
                paths[sender] += paths[next_]
                totals[sender] += totals[next_]
            break

        if next_ not in edges.index:
            break  # path complete

        # start a new path from this sender id
        paths[next_], totals[next_] = [next_], 0
        senders.append(next_)

        next_, amount = edges.loc[next_]
        for sender in senders:
            paths[sender].append(next_)
            totals[sender] += amount

Either way, you now have the full paths and totals worked out for all your transactions. You can turn those back into additional columns:

df['path'], df['total'] = df.sender_id.map(paths), df.sender_id.map(totals)

For your input dataframe, that produces:

   transaction_id sender_id receiver_id  amount                  path  total
0          213234       002         125      10  [002, 125, 689, 233]     85
1          223322       017         354      90       [017, 354, 456]    100
2          343443       125         689      70       [125, 689, 233]     75
3          324433       689         233       5            [689, 233]      5
4          328909       354         456      10            [354, 456]     10

Alternatively, you can pair up the paths and the totals by looping over either dictionary:

for id, path in paths.items():
    print(id, path, totals[id])

which, again for your specific example, produces:

002 ['002', '125', '689', '233'] 85
125 ['125', '689', '233'] 75
689 ['689', '233'] 5
017 ['017', '354', '456'] 100
354 ['354', '456'] 10

Upvotes: 2

Related Questions