Reputation: 458
This is not a duplicate of:
Fastest way to perform complex search on pandas dataframe
Note: pandas ver 0.23.4
Assumptions: data can be laid out in any order.
I have a list:
L = ['A', 'B', 'C', 'D', 'L', 'M', 'N', 'O']
I also have a dataframe. Col1 and Col2 have several associated columns that have related info I wish to keep. The information is arbitrary so I have not filled it in.
Col1 Col2 Col1Info Col2Info Col1moreInfo Col2moreInfo
A B x x x x
B C
D C
L M
M N
N O
I am trying to perform a 'search and group' for each element of the list. For example, if we performed a search on an element of the list, 'D', the following group would be returned.
To From Col1Info Col2Info Col1moreInfo Col2moreInfo
A B x x x x
B C
D C
I have been playing around with networkx but it is a very complex package.
Upvotes: 1
Views: 632
Reputation: 88285
You could define a graph using the values from both columns as edges, and look for the connected_components
. Here's a way using NetworkX
:
import networkx as nx
G=nx.Graph()
G.add_edges_from(df.values.tolist())
cc = list(nx.connected_components(G))
# [{'A', 'B', 'C', 'D'}, {'L', 'M', 'N', 'O'}]
Now say for instance you want to filter by D
, you could then do:
component = next(i for i in cc if 'B' in i)
# {'A', 'B', 'C', 'D'}
And index the dataframe where the values from both columns are in component
:
df[df.isin(component).all(1)]
Col1 Col2
0 A B
1 B C
2 D C
The above can be extended to all items in the list, by generating a list of dataframes. Then we simply have to index using the position in which a given item is present in L
:
L = ['A', 'B', 'C', 'D', 'L', 'M', 'N', 'O']
dfs = [df[df.isin(i).all(1)] for j in L for i in cc if j in i]
print(dfs[L.index('D')])
Col1 Col2
0 A B
1 B C
2 D C
print(dfs[L.index('L')])
Col1 Col2
3 L M
4 M N
5 N O
Upvotes: 3