Reputation: 3432
Hello I have a dataframe such as :
species family Events groups
1 SP1 A 10,22 G1
2 SP1 B 7 G2
3 SP1 C,D 4,5,6,1,3 G3,G4,G5,G6
4 SP2 A 22,10 G1
5 SP2 D,C 6,5,4,3,1 G4,G6,G5,G3
6 SP3 C 4,5,3,6,1 G3,G6,G5
7 SP3 E 7 G2
8 SP3 A 10 G1
9 SP4 C 7,22 G12
and I would like to simply merge row for each where there is at least one duplicated element in each columns (except species
).
For instance I will merge the rows :
species family Events groups
SP1 A 10,22 G1
species family Events groups
SP2 A 22,10 G1
species family Events groups
SP3 A 10 G1
into
species family Events groups
SP1,SP2,SP3 A 10,22 G1
SO if I do the same for each row I should get an expected output as :
species family Events groups
SP1,SP2,SP3 A 10,22 G1
SP1,SP3 B,E 7 G2
SP1,SP2,SP3 C,D 1,3,4,5,6 G3,G4,G6,G5
SP4 C 7,22 G12
Note that SP4 has not been merged with any rows since its group was not present in any other rows.
Does someone have an idea please ? Thank you very much for your help and time
Here is the dataframe in dic format it can helps:
{'species': {1: 'SP1', 2: 'SP1', 3: 'SP1', 4: 'SP2', 5: 'SP2', 6: 'SP3', 7: 'SP3', 8: 'SP3', 9: 'SP4'}, 'family': {1: 'A', 2: 'B', 3: 'C,D', 4: 'A', 5: 'D,C', 6: 'C', 7: 'E', 8: 'A', 9: 'C'}, 'Events': {1: '10,22', 2: '7', 3: '4,5,6,1,3', 4: '22,10', 5: '6,5,4,3,1', 6: '4,5,3,6,1', 7: '7', 8: '10', 9: '7,22'}, 'groups': {1: 'G1', 2: 'G2', 3: 'G3,G4,G5,G6', 4: 'G1', 5: 'G4,G6,G5,G3', 6: 'G3,G6,G5', 7: 'G2', 8: 'G1', 9: 'G12'}}
Upvotes: 4
Views: 86
Reputation: 149175
The trick here is to split the cells on comma, handles them as sets, and finaly join them back. Those are indeed non vectorizable operations, but it leads to (rather) simple code using apply
.
I would first compute a containing group for each row:
g = df['groups'].apply(lambda x: set(x.split(','))) # explode into sets
# keep the larger set from g containing the current one and make it back a string
g2 = g.apply(lambda s: ','.join(sorted(
g[g.apply(lambda x: x.issuperset(s))].max())))
With the sample, it gives:
1 G1
2 G2
3 G3,G4,G5,G6
4 G1
5 G3,G4,G5,G6
6 G3,G4,G5,G6
7 G2
8 G1
9 G12
Name: groups, dtype: object
We can now use it to groupby
the dataframe and aggregate the groups using the set trick:
resul = df[['species', 'family', 'Events']].groupby(g2).agg(
lambda x: ','.join(sorted(set((i for j in x for i in j.split(',')))))
).reset_index().reindex(df.columns)
And we get as expected:
species family Events groups
0 SP1,SP2,SP3 A 10,22 G1
1 SP4 C 22,7 G12
2 SP1,SP3 B,E 7 G2
3 SP1,SP2,SP3 C,D 1,3,4,5,6 G3,G4,G5,G6
Upvotes: 1