Jeremy
Jeremy

Reputation: 5435

Complete recursive transitive dependencies per row in pandas dataframe

Here is a simple program that transitively maps sets of columns:

import pandas as pd

df1vals = [{'c1': '1', 'c2': "2"}]
df1 = pd.DataFrame(df1vals, columns = ['c1' , 'c2'])

df2vals = [{'c2': '2', 'c3': "100"}]
df2 = pd.DataFrame(df2vals, columns = ['c2' , 'c3'])


df3vals = [{'c3': '100', 'c4': "x"}]
df3 = pd.DataFrame(df3vals, columns = ['c3' , 'c4'])

df4vals = [{'c1': '1', 'c4': "m"}]
df4 = pd.DataFrame(df4vals, columns = ['c1' , 'c4'])

df5vals = [{'c2': '2', 'c4': "k"}]
df5 = pd.DataFrame(df5vals, columns = ['c2' , 'c4'])

dfs = [df1,df2, df3, df4, df5]

merged_df = dfs[0]
for df in dfs[1:]:
    common_cols = list(set(merged_df.columns) & set(df.columns))
    merged_df = pd.merge(merged_df, df, on=common_cols, how='outer')  

display(merged_df)

This works to produce this output:

    c1  c2  c3  c4
0   1   2   100 x
1   1   NaN NaN m
2   NaN 2   NaN k

This data is all good, but it's missing the fact that c1 1 is associated with c4 k through c2 2. So, I would also want this row:

1 2 NaN k 

I would think that the same logic that maps 1 -> 2 -> 100 -> x would also work on this, but it does not. Why?

Upvotes: 0

Views: 79

Answers (2)

mozway
mozway

Reputation: 262224

It looks to me like you need to compute a graph of the connected cells to find the connected components.

Let's use networkx for that. Here all cells are connected together:

from itertools import combinations
import networkx as nx

tmp = pd.concat(dfs)

G = nx.from_edgelist([((a, c), (b, d))
 for a, b in combinations(tmp, 2)
 for c, d in zip(tmp[a], tmp[b])
 if pd.notna(c) and pd.notna(d)
 ])

out = list(nx.connected_components(G))

Output:

[{('c4', 'k'), ('c4', 'x'), ('c1', '1'), ('c3', '100'), ('c4', 'm'), ('c2', '2')}]

Graph:

enter image description here

Now for the demo, let's change the value of c2 to 7 in the first dataframe. The output now shows two groups:

[{('c4', 'm'), ('c2', '7'), ('c1', '1')},
 {('c4', 'x'), ('c4', 'k'), ('c2', '2'), ('c3', '100')}]

Graph:

enter image description here

Upvotes: 0

Jeremy
Jeremy

Reputation: 5435

I'm not completely sure, but I think my problem was that I was merging on None (e.g. all matching columns) when in fact I only wanted to merge if the left-most column matched. Therefore the right line is:

merged_df = pd.merge(merged_df, df, on=df.columns[0], how='outer')

Upvotes: 0

Related Questions