Reputation: 608
I have a dataframe like:
df = pd.DataFrame([['A', 3, 'fox'], ['A', 3, 'cat'], ['A', 3, 'dog'],
['B', 2, 'rabbit'], ['B', 2, 'dog'], ['B', 2, 'eel'],
['C', 6, 'fox'], ['C', 6, 'elephant']],
columns=['group', 'val', 'animal'])
df
Output:
group val animal
0 A 3 fox
1 A 3 cat
2 A 3 dog
3 B 2 rabbit
4 B 2 dog
5 B 2 eel
6 C 6 fox
7 C 6 elephant
For a given group, the val
is always the same (so always 3 for A
, 2 for B
, 6 for C
).
How can I generate a dataframe with all combinations of group
and animal
elements? Also val
should be carried over and there should be a column indicating whether this row was present in the original data or added in the permutations.
Desired result:
df = pd.DataFrame([['A', 3, 'fox', 1], ['A', 3, 'cat', 1], ['A', 3, 'dog', 1], ['A', 3, 'rabbit', 0], ['A', 3, 'eel', 0], ['A', 3, 'elephant', 0],
['B', 2, 'rabbit', 1], ['B', 2, 'dog', 1], ['B', 2, 'eel', 1], ['B', 2, 'fox', 0], ['B', 2, 'cat', 0], ['B', 2, 'elephant', 0],
['C', 6, 'fox', 1], ['C', 6, 'elephant', 1], ['C', 6, 'cat', 0], ['C', 6, 'dog', 0], ['C', 6, 'rabbit', 0], ['C', 6, 'eel', 0]],
columns=['group', 'val', 'animal', 'occurred'])
df
Output:
group val animal occurred
0 A 3 fox 1
1 A 3 cat 1
2 A 3 dog 1
3 A 3 rabbit 0
4 A 3 eel 0
5 A 3 elephant 0
6 B 2 rabbit 1
7 B 2 dog 1
8 B 2 eel 1
9 B 2 fox 0
10 B 2 cat 0
11 B 2 elephant 0
12 C 6 fox 1
13 C 6 elephant 1
14 C 6 cat 0
15 C 6 dog 0
16 C 6 rabbit 0
17 C 6 eel 0
How can I do this?
Edit: there have been a few answers which work. I'll give 'best' to whichever can be made to handle the possibility of multiple columns joined to group
(e.g. not just 'val'
but ['val1','val2']
) in an elegant way.
Upvotes: 4
Views: 2507
Reputation: 7958
Update, now it can handle single or multiple val
.
I insert one data (['A', 3, 'fox']
), and it still works.
import pandas as pd
df_single = pd.DataFrame([['A', 3, 'fox'], ['A', 3, 'cat'], ['A', 3, 'dog'],
['A', 3, 'fox'], # <-- new data
['B', 2, 'rabbit'], ['B', 2, 'dog'], ['B', 2, 'eel'],
['C', 6, 'fox'], ['C', 6, 'elephant'],
],
columns=['group', 'val', 'animal'])
df_multiple = pd.DataFrame([['A', 3, 4, 'fox'], ['A', 3, 4, 'cat'], ['A', 3, 4, 'dog'],
['A', 3, 4, 'fox'], # <-- new data # Others solution will occur ValueError("cannot handle a non-unique multi-index!")
['B', 2, 3, 'rabbit'], ['B', 2, 3, 'dog'], ['B', 2, 3, 'eel'],
['C', 6, 7, 'fox'], ['C', 6, 7, 'elephant'],
],
columns=['group', 'val', 'val2', 'animal'])
def solution(df, val_col_name_list: list):
animal_set = df.animal.unique()
data_set = set(zip(*[df[col] for col in (['group'] + val_col_name_list)]))
df_occurred = df.groupby(['group', 'animal']).aggregate(
occurred=pd.NamedAgg(column=val_col_name_list[0], aggfunc='count')
)
df_occurred.reset_index(inplace=True)
df_output = pd.DataFrame(columns=df.columns.to_list() + ['occurred'])
for cur_animal in animal_set:
for cur_group, *val in data_set:
s: pd.Series = df_occurred[(df_occurred.group == cur_group) & (df_occurred.animal == cur_animal)]
occurred: int = s.occurred.values[0] if len(s) else 0
df_output.loc[len(df_output)] = [cur_group] + val + [cur_animal, occurred] # insert data to last row
print(df_output.sort_values(['group', 'occurred'], ascending=[True, False]).reset_index(drop=True))
if __name__ == '__main__':
solution(df_single, ['val'])
solution(df_multiple, ['val', 'val2'])
group val animal occurred
0 A 3 fox 2
1 A 3 cat 1
2 A 3 dog 1
3 A 3 rabbit 0
4 A 3 eel 0
5 A 3 elephant 0
6 B 2 dog 1
7 B 2 rabbit 1
8 B 2 eel 1
9 B 2 fox 0
10 B 2 cat 0
11 B 2 elephant 0
12 C 6 fox 1
13 C 6 elephant 1
14 C 6 cat 0
15 C 6 dog 0
16 C 6 rabbit 0
17 C 6 eel 0
group val val2 animal occurred
0 A 3 4 fox 2
1 A 3 4 cat 1
2 A 3 4 dog 1
3 A 3 4 rabbit 0
4 A 3 4 eel 0
5 A 3 4 elephant 0
6 B 2 3 dog 1
7 B 2 3 rabbit 1
8 B 2 3 eel 1
9 B 2 3 fox 0
10 B 2 3 cat 0
11 B 2 3 elephant 0
12 C 6 7 fox 1
13 C 6 7 elephant 1
14 C 6 7 cat 0
15 C 6 7 dog 0
16 C 6 7 rabbit 0
17 C 6 7 eel 0
Upvotes: 1
Reputation: 28322
One solution would be to use reindex
with MultiIndex
:
mux = pd.MultiIndex.from_product([df['group'].unique(), df['animal'].unique()], names=('group','animal'))
df = df.set_index(['group','animal']).reindex(mux).reset_index()
df['occurred'] = df['val'].notnull().astype(int)
df['val'] = df.groupby('group')['val'].transform('first')
Result:
group animal val occurred
0 A fox 3.0 1
1 A cat 3.0 1
2 A dog 3.0 1
3 A rabbit 3.0 0
4 A eel 3.0 0
5 A elephant 3.0 0
6 B fox 2.0 0
7 B cat 2.0 0
8 B dog 2.0 1
9 B rabbit 2.0 1
10 B eel 2.0 1
11 B elephant 2.0 0
12 C fox 6.0 1
13 C cat 6.0 0
14 C dog 6.0 0
15 C rabbit 6.0 0
16 C eel 6.0 0
17 C elephant 6.0 1
Extension:
To handle multiple val columns, use a list of column names instead of only 'val'
. Only the last two rows are changed slightly.
val_cols = ['val1', 'val2']
mux = pd.MultiIndex.from_product([df['group'].unique(), df['animal'].unique()], names=('group','animal'))
df = df.set_index(['group','animal']).reindex(mux).reset_index()
df['occurred'] = df[val_cols[0]].notnull().astype(int)
df[val_cols ] = df.groupby('group')[val_cols].transform('first')
Upvotes: 4
Reputation: 153460
df = pd.DataFrame([['A', 3, 4, 'fox'], ['A', 3, 4, 'cat'], ['A', 3, 4,'dog'],
['B', 2, 3, 'rabbit'], ['B', 2, 3, 'dog'], ['B', 2, 3,'eel'],
['C', 6, 7, 'fox'], ['C', 6, 7, 'elephant']],
columns=['group', 'val1', 'val2', 'animal'])
dfi = df.set_index(['group', 'animal']).assign(occurred=1)
indx = pd.MultiIndex.from_product(dfi.index.levels)
dfi = dfi.reindex(indx, fill_value=0)
dfi[['val1', 'val2']] = dfi.groupby(level=0)[['val1','val2']].transform('max')
print(dfi.reset_index().sort_values(['group', 'occurred'], ascending=[True, False]))
Output:
group animal val1 val2 occurred
0 A cat 3 4 1
1 A dog 3 4 1
4 A fox 3 4 1
2 A eel 3 4 0
3 A elephant 3 4 0
5 A rabbit 3 4 0
7 B dog 2 3 1
8 B eel 2 3 1
11 B rabbit 2 3 1
6 B cat 2 3 0
9 B elephant 2 3 0
10 B fox 2 3 0
15 C elephant 6 7 1
16 C fox 6 7 1
12 C cat 6 7 0
13 C dog 6 7 0
14 C eel 6 7 0
17 C rabbit 6 7 0
IIUC, you can do it like this assign 'Observed', using set_index, create multiindex, then groupby to fill NaN.
dfi = df.set_index(['group', 'animal']).assign(occurred=1)
indx = pd.MultiIndex.from_product(dfi.index.levels)
dfi = dfi.reindex(indx, fill_value=0)
dfi['val'] = dfi.groupby(level=0)['val'].transform('max')
dfi.reset_index().sort_values(['group', 'occurred'], ascending=[True, False])
Output:
group animal val occurred
0 A fox 3 1
1 A cat 3 1
2 A dog 3 1
3 A rabbit 3 0
4 A eel 3 0
5 A elephant 3 0
8 B dog 2 1
9 B rabbit 2 1
10 B eel 2 1
6 B fox 2 0
7 B cat 2 0
11 B elephant 2 0
12 C fox 6 1
17 C elephant 6 1
13 C cat 6 0
14 C dog 6 0
15 C rabbit 6 0
16 C eel 6 0
Upvotes: 2
Reputation: 13255
Try:
df2 = pd.DataFrame(list(product(df.group.unique(), df.animal.unique())), columns=['group', 'animal'])
df2['val'] = df2['group'].map(df.set_index('group')['val'].to_dict())
df2.merge(df.drop('val', axis=1).assign(occurred=1), how='outer').fillna(0, downcast='infer')
Upvotes: 1
Reputation: 11
You want to make a pivot table.
This is done in Pandas with the pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False) command.
As you can see there are many arguments that pandas.pivot_table() takes, but there are some main ones. In your case, you are looking for pandas.pivot_table(df, columns=["group", "animal"], aggfunc = custom).
custom is a function you would have to write above the execution of the pivot_table() line to get the functionality you described: "val should be carried over and there should be a column indicating whether this row was present in the original data or added in the permutations".
Upvotes: 0