Reputation: 5435
Here is a simple program that transitively maps sets of columns:
import pandas as pd
df1vals = [{'c1': '1', 'c2': "2"}]
df1 = pd.DataFrame(df1vals, columns = ['c1' , 'c2'])
df2vals = [{'c2': '2', 'c3': "100"}]
df2 = pd.DataFrame(df2vals, columns = ['c2' , 'c3'])
df3vals = [{'c3': '100', 'c4': "x"}]
df3 = pd.DataFrame(df3vals, columns = ['c3' , 'c4'])
df4vals = [{'c1': '1', 'c4': "m"}]
df4 = pd.DataFrame(df4vals, columns = ['c1' , 'c4'])
df5vals = [{'c2': '2', 'c4': "k"}]
df5 = pd.DataFrame(df5vals, columns = ['c2' , 'c4'])
dfs = [df1,df2, df3, df4, df5]
merged_df = dfs[0]
for df in dfs[1:]:
common_cols = list(set(merged_df.columns) & set(df.columns))
merged_df = pd.merge(merged_df, df, on=common_cols, how='outer')
display(merged_df)
This works to produce this output:
c1 c2 c3 c4
0 1 2 100 x
1 1 NaN NaN m
2 NaN 2 NaN k
This data is all good, but it's missing the fact that c1 1 is associated with c4 k through c2 2. So, I would also want this row:
1 2 NaN k
I would think that the same logic that maps 1 -> 2 -> 100 -> x would also work on this, but it does not. Why?
Upvotes: 0
Views: 79
Reputation: 262224
It looks to me like you need to compute a graph of the connected cells to find the connected components.
Let's use networkx
for that. Here all cells are connected together:
from itertools import combinations
import networkx as nx
tmp = pd.concat(dfs)
G = nx.from_edgelist([((a, c), (b, d))
for a, b in combinations(tmp, 2)
for c, d in zip(tmp[a], tmp[b])
if pd.notna(c) and pd.notna(d)
])
out = list(nx.connected_components(G))
Output:
[{('c4', 'k'), ('c4', 'x'), ('c1', '1'), ('c3', '100'), ('c4', 'm'), ('c2', '2')}]
Graph:
Now for the demo, let's change the value of c2
to 7
in the first dataframe. The output now shows two groups:
[{('c4', 'm'), ('c2', '7'), ('c1', '1')},
{('c4', 'x'), ('c4', 'k'), ('c2', '2'), ('c3', '100')}]
Graph:
Upvotes: 0
Reputation: 5435
I'm not completely sure, but I think my problem was that I was merging on None (e.g. all matching columns) when in fact I only wanted to merge if the left-most column matched. Therefore the right line is:
merged_df = pd.merge(merged_df, df, on=df.columns[0], how='outer')
Upvotes: 0