Mobeus Zoom
Mobeus Zoom

Reputation: 608

Generate all combinations in pandas

I have a dataframe like:

df = pd.DataFrame([['A', 3, 'fox'], ['A', 3, 'cat'], ['A', 3, 'dog'],
                   ['B', 2, 'rabbit'], ['B', 2, 'dog'], ['B', 2, 'eel'],
                   ['C', 6, 'fox'], ['C', 6, 'elephant']],
                  columns=['group', 'val', 'animal'])
df

Output:

    group   val animal
0   A       3   fox
1   A       3   cat
2   A       3   dog
3   B       2   rabbit
4   B       2   dog
5   B       2   eel
6   C       6   fox
7   C       6   elephant

For a given group, the val is always the same (so always 3 for A, 2 for B, 6 for C).

How can I generate a dataframe with all combinations of group and animal elements? Also val should be carried over and there should be a column indicating whether this row was present in the original data or added in the permutations.

Desired result:

df = pd.DataFrame([['A', 3, 'fox', 1], ['A', 3, 'cat', 1], ['A', 3, 'dog', 1], ['A', 3, 'rabbit', 0], ['A', 3, 'eel', 0], ['A', 3, 'elephant', 0],
                   ['B', 2, 'rabbit', 1], ['B', 2, 'dog', 1], ['B', 2, 'eel', 1], ['B', 2, 'fox', 0], ['B', 2, 'cat', 0], ['B', 2, 'elephant', 0],
                   ['C', 6, 'fox', 1], ['C', 6, 'elephant', 1], ['C', 6, 'cat', 0], ['C', 6, 'dog', 0], ['C', 6, 'rabbit', 0], ['C', 6, 'eel', 0]], 
                  columns=['group', 'val', 'animal', 'occurred'])
df

Output:

    group   val animal  occurred
0   A       3   fox         1
1   A       3   cat         1
2   A       3   dog         1
3   A       3   rabbit      0
4   A       3   eel         0
5   A       3   elephant    0
6   B       2   rabbit      1
7   B       2   dog         1
8   B       2   eel         1
9   B       2   fox         0
10  B       2   cat         0
11  B       2   elephant    0
12  C       6   fox         1
13  C       6   elephant    1
14  C       6   cat         0
15  C       6   dog         0
16  C       6   rabbit      0
17  C       6   eel         0

How can I do this?

Edit: there have been a few answers which work. I'll give 'best' to whichever can be made to handle the possibility of multiple columns joined to group (e.g. not just 'val' but ['val1','val2']) in an elegant way.

Upvotes: 4

Views: 2507

Answers (5)

Carson
Carson

Reputation: 7958

Update, now it can handle single or multiple val.

I insert one data (['A', 3, 'fox']), and it still works.

import pandas as pd

df_single = pd.DataFrame([['A', 3, 'fox'], ['A', 3, 'cat'], ['A', 3, 'dog'],
                          ['A', 3, 'fox'],  # <-- new data
                          ['B', 2, 'rabbit'], ['B', 2, 'dog'], ['B', 2, 'eel'],
                          ['C', 6, 'fox'], ['C', 6, 'elephant'],
                          ],
                         columns=['group', 'val', 'animal'])

df_multiple = pd.DataFrame([['A', 3, 4, 'fox'], ['A', 3, 4, 'cat'], ['A', 3, 4, 'dog'],
                            ['A', 3, 4, 'fox'],  # <-- new data  #  Others solution will occur ValueError("cannot handle a non-unique multi-index!")
                            ['B', 2, 3, 'rabbit'], ['B', 2, 3, 'dog'], ['B', 2, 3, 'eel'],
                            ['C', 6, 7, 'fox'], ['C', 6, 7, 'elephant'],
                            ],
                           columns=['group', 'val', 'val2', 'animal'])


def solution(df, val_col_name_list: list):
    animal_set = df.animal.unique()
    data_set = set(zip(*[df[col] for col in (['group'] + val_col_name_list)]))

    df_occurred = df.groupby(['group', 'animal']).aggregate(
        occurred=pd.NamedAgg(column=val_col_name_list[0], aggfunc='count')
    )
    df_occurred.reset_index(inplace=True)

    df_output = pd.DataFrame(columns=df.columns.to_list() + ['occurred'])

    for cur_animal in animal_set:
        for cur_group, *val in data_set:
            s: pd.Series = df_occurred[(df_occurred.group == cur_group) & (df_occurred.animal == cur_animal)]
            occurred: int = s.occurred.values[0] if len(s) else 0
            df_output.loc[len(df_output)] = [cur_group] + val + [cur_animal, occurred]  # insert data to last row
    print(df_output.sort_values(['group', 'occurred'], ascending=[True, False]).reset_index(drop=True))


if __name__ == '__main__':
    solution(df_single, ['val'])
    solution(df_multiple, ['val', 'val2'])

   group val    animal occurred
0      A   3       fox        2
1      A   3       cat        1
2      A   3       dog        1
3      A   3    rabbit        0
4      A   3       eel        0
5      A   3  elephant        0
6      B   2       dog        1
7      B   2    rabbit        1
8      B   2       eel        1
9      B   2       fox        0
10     B   2       cat        0
11     B   2  elephant        0
12     C   6       fox        1
13     C   6  elephant        1
14     C   6       cat        0
15     C   6       dog        0
16     C   6    rabbit        0
17     C   6       eel        0

   group val val2    animal occurred
0      A   3    4       fox        2
1      A   3    4       cat        1
2      A   3    4       dog        1
3      A   3    4    rabbit        0
4      A   3    4       eel        0
5      A   3    4  elephant        0
6      B   2    3       dog        1
7      B   2    3    rabbit        1
8      B   2    3       eel        1
9      B   2    3       fox        0
10     B   2    3       cat        0
11     B   2    3  elephant        0
12     C   6    7       fox        1
13     C   6    7  elephant        1
14     C   6    7       cat        0
15     C   6    7       dog        0
16     C   6    7    rabbit        0
17     C   6    7       eel        0

Upvotes: 1

Shaido
Shaido

Reputation: 28322

One solution would be to use reindex with MultiIndex:

mux = pd.MultiIndex.from_product([df['group'].unique(), df['animal'].unique()], names=('group','animal'))
df = df.set_index(['group','animal']).reindex(mux).reset_index()
df['occurred'] = df['val'].notnull().astype(int)
df['val'] = df.groupby('group')['val'].transform('first')

Result:

   group    animal  val  occurred
0      A       fox  3.0         1
1      A       cat  3.0         1
2      A       dog  3.0         1
3      A    rabbit  3.0         0
4      A       eel  3.0         0
5      A  elephant  3.0         0
6      B       fox  2.0         0
7      B       cat  2.0         0
8      B       dog  2.0         1
9      B    rabbit  2.0         1
10     B       eel  2.0         1
11     B  elephant  2.0         0
12     C       fox  6.0         1
13     C       cat  6.0         0
14     C       dog  6.0         0
15     C    rabbit  6.0         0
16     C       eel  6.0         0
17     C  elephant  6.0         1

Extension:

To handle multiple val columns, use a list of column names instead of only 'val'. Only the last two rows are changed slightly.

val_cols = ['val1', 'val2']

mux = pd.MultiIndex.from_product([df['group'].unique(), df['animal'].unique()], names=('group','animal'))
df = df.set_index(['group','animal']).reindex(mux).reset_index()
df['occurred'] = df[val_cols[0]].notnull().astype(int)
df[val_cols ] = df.groupby('group')[val_cols].transform('first')

Upvotes: 4

Scott Boston
Scott Boston

Reputation: 153460

Updated for multi values:

df = pd.DataFrame([['A', 3, 4, 'fox'], ['A', 3, 4, 'cat'], ['A', 3, 4,'dog'],
                   ['B', 2, 3, 'rabbit'], ['B', 2, 3, 'dog'], ['B', 2, 3,'eel'],
                   ['C', 6, 7, 'fox'], ['C', 6, 7, 'elephant']],
                  columns=['group', 'val1', 'val2', 'animal'])


dfi = df.set_index(['group', 'animal']).assign(occurred=1)
indx = pd.MultiIndex.from_product(dfi.index.levels)
dfi = dfi.reindex(indx, fill_value=0)
dfi[['val1', 'val2']]  = dfi.groupby(level=0)[['val1','val2']].transform('max')
print(dfi.reset_index().sort_values(['group', 'occurred'], ascending=[True, False]))

Output:

   group    animal  val1  val2  occurred
0      A       cat     3     4         1
1      A       dog     3     4         1
4      A       fox     3     4         1
2      A       eel     3     4         0
3      A  elephant     3     4         0
5      A    rabbit     3     4         0
7      B       dog     2     3         1
8      B       eel     2     3         1
11     B    rabbit     2     3         1
6      B       cat     2     3         0
9      B  elephant     2     3         0
10     B       fox     2     3         0
15     C  elephant     6     7         1
16     C       fox     6     7         1
12     C       cat     6     7         0
13     C       dog     6     7         0
14     C       eel     6     7         0
17     C    rabbit     6     7         0

IIUC, you can do it like this assign 'Observed', using set_index, create multiindex, then groupby to fill NaN.

dfi = df.set_index(['group', 'animal']).assign(occurred=1)
indx = pd.MultiIndex.from_product(dfi.index.levels)
dfi = dfi.reindex(indx, fill_value=0)
dfi['val']  = dfi.groupby(level=0)['val'].transform('max')
dfi.reset_index().sort_values(['group', 'occurred'], ascending=[True, False])

Output:

   group    animal  val  occurred
0      A       fox    3         1
1      A       cat    3         1
2      A       dog    3         1
3      A    rabbit    3         0
4      A       eel    3         0
5      A  elephant    3         0
8      B       dog    2         1
9      B    rabbit    2         1
10     B       eel    2         1
6      B       fox    2         0
7      B       cat    2         0
11     B  elephant    2         0
12     C       fox    6         1
17     C  elephant    6         1
13     C       cat    6         0
14     C       dog    6         0
15     C    rabbit    6         0
16     C       eel    6         0

Upvotes: 2

Space Impact
Space Impact

Reputation: 13255

Try:

df2 = pd.DataFrame(list(product(df.group.unique(), df.animal.unique())), columns=['group', 'animal'])
df2['val'] = df2['group'].map(df.set_index('group')['val'].to_dict())
df2.merge(df.drop('val', axis=1).assign(occurred=1), how='outer').fillna(0, downcast='infer')

Upvotes: 1

Naman Patel
Naman Patel

Reputation: 11

You want to make a pivot table.

This is done in Pandas with the pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False) command.

As you can see there are many arguments that pandas.pivot_table() takes, but there are some main ones. In your case, you are looking for pandas.pivot_table(df, columns=["group", "animal"], aggfunc = custom).

custom is a function you would have to write above the execution of the pivot_table() line to get the functionality you described: "val should be carried over and there should be a column indicating whether this row was present in the original data or added in the permutations".

Upvotes: 0

Related Questions