Group a pandas dataframe by one column OR another one

Question

This is my dataframe :

df = pd.DataFrame({'name' : ['name1', 'name2', 'name1', 'name3'],
                    'rate' : [1,2,2,3],
                    'id' : range(4)})

   id   name  rate
0   0  name1     1
1   1  name2     2
2   2  name1     2
3   3  name3     3

I want to group the rows of a pandas dataframe if they have the same values in column name OR in column rate.

          id            name       rate
0  [0, 1, 2]  [name1, name2]  [1, 2, 2]
1        [3]           name3        [3]

I have a huge dataframe so I don't want to iterate over each row (unless that the only solution). What should I do ?

(I can use Numpy arrays instead of Pandas dataframe)

Ami Tavory · Accepted Answer

Your conditions are unboundedly transitive. Say in rows 2i, 2i + 1 the name is shared and in rows 2i + 1, 2i + 2 the rate is shared, you need to keep linking rows.

One way to solve this is using the graph theory's connected components algorithm.

For this you can use networkx. In code, it could be as follows:

import networkx as nx
import itertools

G = nx.Graph()
G.add_nodes_from(df.id)
G.add_edges_from(
    [(r1[1]['id'], r2[1]['id']) for (r1, r2) in itertools.product(df.iterrows(), df.iterrows()) if r1[1].id < r2[1].id and (r1[1]['rate'] == r2[1]['rate'] or r1[1]['name'] == r2[1]['name'])]
)

Let's create a group column, indicating, for each row, its group:

df['group'] = df['id'].map(
    dict(itertools.chain.from_iterable([[(ee, i) for ee in e] for (i, e) in enumerate(nx.connected_components(G))])))
>>> df.group
0    0
1    0
2    0
3    1

Now you just need to groupby the group column, and apply a list.

Group a pandas dataframe by one column OR another one

Answers (1)

Related Questions