mitsi
mitsi

Reputation: 1165

Group a pandas dataframe by one column OR another one

This is my dataframe :

df = pd.DataFrame({'name' : ['name1', 'name2', 'name1', 'name3'],
                    'rate' : [1,2,2,3],
                    'id' : range(4)})

   id   name  rate
0   0  name1     1
1   1  name2     2
2   2  name1     2
3   3  name3     3

I want to group the rows of a pandas dataframe if they have the same values in column name OR in column rate.

          id            name       rate
0  [0, 1, 2]  [name1, name2]  [1, 2, 2]
1        [3]           name3        [3]

I have a huge dataframe so I don't want to iterate over each row (unless that the only solution). What should I do ?

(I can use Numpy arrays instead of Pandas dataframe)

Upvotes: 3

Views: 279

Answers (1)

Ami Tavory
Ami Tavory

Reputation: 76346

Your conditions are unboundedly transitive. Say in rows 2i, 2i + 1 the name is shared and in rows 2i + 1, 2i + 2 the rate is shared, you need to keep linking rows.

One way to solve this is using the graph theory's connected components algorithm.

enter image description here

For this you can use networkx. In code, it could be as follows:

import networkx as nx
import itertools

G = nx.Graph()
G.add_nodes_from(df.id)
G.add_edges_from(
    [(r1[1]['id'], r2[1]['id']) for (r1, r2) in itertools.product(df.iterrows(), df.iterrows()) if r1[1].id < r2[1].id and (r1[1]['rate'] == r2[1]['rate'] or r1[1]['name'] == r2[1]['name'])]
)

Let's create a group column, indicating, for each row, its group:

df['group'] = df['id'].map(
    dict(itertools.chain.from_iterable([[(ee, i) for ee in e] for (i, e) in enumerate(nx.connected_components(G))])))
>>> df.group
0    0
1    0
2    0
3    1

Now you just need to groupby the group column, and apply a list.

Upvotes: 3

Related Questions