Derek Eden
Derek Eden

Reputation: 4638

how to pandas groupby using two columns but merge groups for unique combinations of keys in those two columns

I'm struggling to figure this out without writing some hideous loop to go over and check the groups after the groupby and merge them together. I feel like their has to be a way to do this that I just don't know.

What I'm trying to do is group a dataframe by two of its columns, but sometimes there will be combinations of those groups where the keys are flipped (i.e. [key1, key2] will have a group and [key2, key1] will have a group. I actually want to merge the groups from these combinations into a single group.

It is possible to do in a loop after the fact. I have also tried using some methods like this:

unique combinations of values in selected columns in pandas data frame and count

but could not get it to work.

here is a sample of my df:

            Ves-1 type          Ves-2 type    Duration
0                cargo                 tug  898.559993
1     fishing_trawling                 tug  898.559992
2   fishing_transiting                 tug  898.559993
3   fishing_transiting                 tug  898.559993
4                  tug                 tug  898.559992
5                cargo                 tug  898.560002
6                cargo                 tug  898.560002
7            passenger                 tug  907.200008
8             pleasure                 tug  898.560003
9                cargo                 tug  898.559993
10               cargo                 tug  898.559993
11               cargo  fishing_transiting  898.560002
12               cargo  fishing_transiting  898.559993
13               cargo  fishing_transiting  898.560002
14                 tug  fishing_transiting  898.560003
15               cargo  fishing_transiting  907.200008
16               cargo  fishing_transiting  907.200008
17                 tug  fishing_transiting  898.560002
18               cargo  fishing_transiting  898.560002
19  fishing_transiting  fishing_transiting  898.559993

if I just do a simple groupby using both Ves columns:

>>> test.groupby(['Ves-1 type','Ves-2 type'])['Duration'].agg(list)
Ves-1 type          Ves-2 type
cargo               fishing_transiting    [898.560002, 898.5599930000001, 898.560002, 90...
                    tug                   [898.5599930000001, 898.560002, 898.560002, 89...
fishing_transiting  fishing_transiting                                  [898.5599930000001]
                    tug                              [898.5599930000001, 898.5599930000001]
fishing_trawling    tug                                                 [898.5599920000001]
passenger           tug                                                        [907.200008]
pleasure            tug                                                        [898.560003]
tug                 fishing_transiting                             [898.560003, 898.560002]
                    tug                                                 [898.5599920000001]

the problem is now I have a fishing_transiting/tug combo and a tug/fishing_transiting combo... is there any way to have these groupes merged together?

EDIT - another workaround I tried that works, but would like to know if there's a way to just handle this in groupby:

>>> test['key'] = list(zip(test['Ves-1 type'].values, test['Ves-2 type'].values))
>>> test['key'] = test['key'].apply(sorted).astype(str)
>>> test.groupby('key')['Duration'].agg(list)
key
['cargo', 'fishing_transiting']                 [898.560002, 898.5599930000001, 898.560002, 90...
['cargo', 'tug']                                [898.5599930000001, 898.560002, 898.560002, 89...
['fishing_transiting', 'fishing_transiting']                                  [898.5599930000001]
['fishing_transiting', 'tug']                   [898.5599930000001, 898.5599930000001, 898.560...
['fishing_trawling', 'tug']                                                   [898.5599920000001]
['passenger', 'tug']                                                                 [907.200008]
['pleasure', 'tug']                                                                  [898.560003]
['tug', 'tug']                                                                [898.5599920000001]

Upvotes: 2

Views: 1450

Answers (1)

Shubham Sharma
Shubham Sharma

Reputation: 71687

Let's sort the values in the columns Ves-1 type and Ves-2 type along axis=1 then groupby the dataframe on these sorted columns and agg Duration using list:

c = ['Ves-1 type', 'Ves-2 type']
df.groupby(np.sort(df[c], axis=1).T.tolist())['Duration'].agg(list)

cargo               fishing_transiting    [898.5600019999999, 898.559993, 898.5600019999...
                    tug                   [898.559993, 898.5600019999999, 898.5600019999...
fishing_transiting  fishing_transiting                                         [898.559993]
                    tug                   [898.559993, 898.559993, 898.5600029999999, 89...
fishing_trawling    tug                                                        [898.559992]
passenger           tug                                                        [907.200008]
pleasure            tug                                                 [898.5600029999999]
tug                 tug                                                        [898.559992]
Name: Duration, dtype: object

Upvotes: 4

Related Questions