Comparing results in dataframe and grouping results

Question

I have a dataset consisting of emails and how they are similar to each other correlated by their score.

    emlgroup1 emlgroup2  scores
79   1739.eml  1742.eml     100
130  1742.eml  1739.eml     100
153  1743.eml  1744.eml      99
157  1743.eml  1748.eml      82
170  1744.eml  1743.eml      99
175  1744.eml  1748.eml      82
231  1747.eml  1750.eml      85
242  1748.eml  1743.eml      82
243  1748.eml  1744.eml      82
282  1750.eml  1747.eml      85

What I want to do now is group them automatically like so and put that in a new dataframe with one column.

group 1: 1739.eml, 1742.eml

group 2: 1743.eml, 1744.eml, 1748

group 3: 1747.eml, 1750.eml

Desired Output:

         Col 1
1  1739.eml 1742.eml
2  1743.eml 1744.eml 1748.eml
3  1747.eml 1750.eml

I am getting stuck at the logic part where it splits the data into another group/cluster. I'm really new to posting on StackOverflow so I hope I am not committing any sins, Thanks in advance!

BENY · Accepted Answer

This network problem using networkx

import networkx as nx 
G=nx.from_pandas_edgelist(df, 'emlgroup1', 'emlgroup2')
l=list(nx.connected_components(G))
l
[{'1739.eml', '1742.eml'}, {'1744.eml', '1743.eml', '1748.eml'}, {'1747.eml', '1750.eml'}]

pd.Series(l).to_frame('col 1')
                            col 1
0            {1739.eml, 1742.eml}
1  {1744.eml, 1743.eml, 1748.eml}
2            {1747.eml, 1750.eml}

Comparing results in dataframe and grouping results

Answers (1)

Related Questions