Reputation: 21
I am new in Python and I was wondering if somebody can help me with the below task.
I am having the below dataframe df with the following columns:
Each primary entity (Entity type: A) might consist of some secondary entities (Entity types; X or Y). For entity types A, the primary id is the same with the secondary id. Also, each primary entity and each secondary entity have a value.
In columns 'Sum of values Secondary id X' and 'Sum of values Secondary id Y', I want to have the aggregate value of the secondary entities (X and Y) which correspond to each primary entity. The aggregate values should be in the row of the primary entity.
So, my initial df is this:
Primary ID | Secondary ID | Entity type | Value |
---|---|---|---|
0109 | 0109 | A | 200 |
0109 | A234 | X | 100 |
0109 | A234 | X | 50 |
9996 | 9996 | A | 400 |
9996 | AAGT | X | 120 |
9996 | AABG | X | 30 |
9996 | 0082 | Y | 50 |
A765 | A765 | A | 50 |
And I just want to add the 2 columns, without changing the format of the initial dataframe:
Primary ID | Secondary ID | Entity type | Value | Sum of values Secondary id X | Sum of values Secondary id Y |
---|---|---|---|---|---|
0109 | 0109 | A | 200 | 150 | 0 |
0109 | A234 | X | 100 | 0 | 0 |
0109 | A234 | X | 50 | 0 | 0 |
9996 | 9996 | A | 400 | 150 | 50 |
9996 | AAGT | X | 120 | 0 | 0 |
9996 | AABG | X | 30 | 0 | 0 |
9996 | 0082 | Y | 50 | 0 | 0 |
A765 | A765 | A | 50 | 0 | 0 |
Thank you!
Upvotes: 2
Views: 123
Reputation: 863751
First solution is used first by replace Value
s by missing values by Series.where
and then for new columns is used GroupBy.transform
, last replace duplicates by 0
in DataFrame.mask
:
cols = ['Sum of values Secondary id X','Sum of values Secondary id Y']
df[cols] = (df.assign(x = df['Value'].where(df['Entity type'].eq('X')),
y = df['Value'].where(df['Entity type'].eq('Y')))
.groupby('Primary ID')[['x','y']]
.transform('sum')
.mask(df['Primary ID'].duplicated(), 0)
.astype(int))
print (df)
Primary ID Secondary ID Entity type Value Sum of values Secondary id X \
0 0109 0109 A 200 150
1 0109 A234 X 100 0
2 0109 A234 X 50 0
3 9996 9996 A 400 150
4 9996 AAGT X 120 0
5 9996 AABG X 30 0
6 9996 0082 Y 50 0
7 A765 A765 A 50 0
Sum of values Secondary id Y
0 0
1 0
2 0
3 50
4 0
5 0
6 0
7 0
If need all values to new columns use DataFrame.pivot_table
with DataFrame.join
:
df1 = (df.pivot_table(index='Primary ID',
columns='Entity type',
values='Value',
aggfunc='sum', fill_value=0)
.add_prefix('Sum of values Secondary id '))
df = df.join(df1, on='Primary ID')
df.loc[df['Primary ID'].duplicated(), df1.columns] = 0
print (df)
Primary ID Secondary ID Entity type Value Sum of values Secondary id A \
0 0109 0109 A 200 200
1 0109 A234 X 100 0
2 0109 A234 X 50 0
3 9996 9996 A 400 400
4 9996 AAGT X 120 0
5 9996 AABG X 30 0
6 9996 0082 Y 50 0
7 A765 A765 A 50 50
Sum of values Secondary id X Sum of values Secondary id Y
0 150 0
1 0 0
2 0 0
3 150 50
4 0 0
5 0 0
6 0 0
7 0 0
Upvotes: 0
Reputation: 13349
try creating dict using:
(df.groupby(['Primary ID', 'Entity type'])['Value']
.sum()
.unstack(-1)
.fillna(0).reset_index()
)
Entity type Primary ID A X Y
0 0109 200.0 150.0 0.0
1 9996 400.0 150.0 50.0
2 A765 50.0 0.0 0.0
The above data can be converted into a dict and then you can map.
Complete solution:
map_df = (df.groupby(['Primary ID', 'Entity type'])['Value']
.sum()
.unstack(-1)
.fillna(0).reset_index()
).drop(['A'], axis=1).set_index('Primary ID')
df.set_index('Primary ID', inplace=True)
df.loc[df['Entity type'].eq('A'),['new_x', 'new_y']] = map_df.values
df:
Secondary ID | Entity type | Value | new_x | new_y | |
---|---|---|---|---|---|
Primary ID | |||||
0109 | 0109 | A | 200 | 150.0 | 0.0 |
0109 | A234 | X | 100 | 0.0 | 0.0 |
0109 | A234 | X | 50 | 0.0 | 0.0 |
9996 | 9996 | A | 400 | 150.0 | 50.0 |
9996 | AAGT | X | 120 | 0.0 | 0.0 |
9996 | AABG | X | 30 | 0.0 | 0.0 |
9996 | 0082 | Y | 50 | 0.0 | 0.0 |
A765 | A765 | A | 50 | 0.0 | 0.0 |
Upvotes: 1