Reputation: 13
I have a dataset consisting of emails and how they are similar to each other correlated by their score.
emlgroup1 emlgroup2 scores
79 1739.eml 1742.eml 100
130 1742.eml 1739.eml 100
153 1743.eml 1744.eml 99
157 1743.eml 1748.eml 82
170 1744.eml 1743.eml 99
175 1744.eml 1748.eml 82
231 1747.eml 1750.eml 85
242 1748.eml 1743.eml 82
243 1748.eml 1744.eml 82
282 1750.eml 1747.eml 85
What I want to do now is group them automatically like so and put that in a new dataframe with one column.
group 1: 1739.eml, 1742.eml
group 2: 1743.eml, 1744.eml, 1748
group 3: 1747.eml, 1750.eml
Desired Output:
Col 1
1 1739.eml 1742.eml
2 1743.eml 1744.eml 1748.eml
3 1747.eml 1750.eml
I am getting stuck at the logic part where it splits the data into another group/cluster. I'm really new to posting on StackOverflow so I hope I am not committing any sins, Thanks in advance!
Upvotes: 1
Views: 35
Reputation: 323236
This network problem using networkx
import networkx as nx
G=nx.from_pandas_edgelist(df, 'emlgroup1', 'emlgroup2')
l=list(nx.connected_components(G))
l
[{'1739.eml', '1742.eml'}, {'1744.eml', '1743.eml', '1748.eml'}, {'1747.eml', '1750.eml'}]
pd.Series(l).to_frame('col 1')
col 1
0 {1739.eml, 1742.eml}
1 {1744.eml, 1743.eml, 1748.eml}
2 {1747.eml, 1750.eml}
Upvotes: 2