chippycentra
chippycentra

Reputation: 3432

Merge rows with duplications in pandas

Hello I have a dataframe such as :

   species family     Events       groups
1     SP1      A      10,22           G1
2     SP1      B          7           G2
3     SP1    C,D  4,5,6,1,3  G3,G4,G5,G6
4     SP2      A      22,10           G1
5     SP2    D,C  6,5,4,3,1  G4,G6,G5,G3
6     SP3      C  4,5,3,6,1     G3,G6,G5
7     SP3      E          7           G2
8     SP3      A         10           G1
9     SP4      C       7,22          G12

and I would like to simply merge row for each where there is at least one duplicated element in each columns (except species).

For instance I will merge the rows :

species family    Events      groups
SP1      A        10,22       G1
species family    Events      groups
SP2      A        22,10       G1
species family    Events      groups
SP3      A        10          G1

into

species      family    Events      groups
SP1,SP2,SP3  A         10,22       G1

SO if I do the same for each row I should get an expected output as :

species      family    Events      groups
SP1,SP2,SP3  A         10,22       G1
SP1,SP3      B,E       7           G2
SP1,SP2,SP3  C,D       1,3,4,5,6   G3,G4,G6,G5
SP4          C         7,22        G12 

Note that SP4 has not been merged with any rows since its group was not present in any other rows.

Does someone have an idea please ? Thank you very much for your help and time

Here is the dataframe in dic format it can helps:

{'species': {1: 'SP1', 2: 'SP1', 3: 'SP1', 4: 'SP2', 5: 'SP2', 6: 'SP3', 7: 'SP3', 8: 'SP3', 9: 'SP4'}, 'family': {1: 'A', 2: 'B', 3: 'C,D', 4: 'A', 5: 'D,C', 6: 'C', 7: 'E', 8: 'A', 9: 'C'}, 'Events': {1: '10,22', 2: '7', 3: '4,5,6,1,3', 4: '22,10', 5: '6,5,4,3,1', 6: '4,5,3,6,1', 7: '7', 8: '10', 9: '7,22'}, 'groups': {1: 'G1', 2: 'G2', 3: 'G3,G4,G5,G6', 4: 'G1', 5: 'G4,G6,G5,G3', 6: 'G3,G6,G5', 7: 'G2', 8: 'G1', 9: 'G12'}}

Upvotes: 4

Views: 86

Answers (1)

Serge Ballesta
Serge Ballesta

Reputation: 149175

The trick here is to split the cells on comma, handles them as sets, and finaly join them back. Those are indeed non vectorizable operations, but it leads to (rather) simple code using apply.

I would first compute a containing group for each row:

g = df['groups'].apply(lambda x: set(x.split(',')))   # explode into sets
# keep the larger set from g containing the current one and make it back a string
g2 = g.apply(lambda s: ','.join(sorted(
    g[g.apply(lambda x: x.issuperset(s))].max())))

With the sample, it gives:

1             G1
2             G2
3    G3,G4,G5,G6
4             G1
5    G3,G4,G5,G6
6    G3,G4,G5,G6
7             G2
8             G1
9            G12
Name: groups, dtype: object

We can now use it to groupby the dataframe and aggregate the groups using the set trick:

resul = df[['species', 'family', 'Events']].groupby(g2).agg(
    lambda x: ','.join(sorted(set((i for j in x for i in j.split(',')))))
    ).reset_index().reindex(df.columns)

And we get as expected:

       species family     Events       groups
0  SP1,SP2,SP3      A      10,22           G1
1          SP4      C       22,7          G12
2      SP1,SP3    B,E          7           G2
3  SP1,SP2,SP3    C,D  1,3,4,5,6  G3,G4,G5,G6

Upvotes: 1

Related Questions