Reputation: 1503
For a DataFrame:
>>> df = DataFrame([['2021-03-31', 'A0019', '990RT', 'OFFSET', '0.10'],['2021-03-31', 'A1019', '990CT', 'MARK', '0.10'],['2021-03-31', 'A0019', '990RT', 'M
ARK', '100'],['2021-03-31', 'A0019', '990RT', 'OFFSET', '0.70'],['2021-03-31', 'A0029', '990CT', 'OFFSET', '1.10'],['2021-03-31', 'A0029', '990CT', 'MARK',
'9.10'],['2021-03-31', 'A0019', '990QT', 'MARK', '99.10'], ['2021-03-31', 'C0019', '990QT', 'OFFSET', '1'], ['2021-03-31', 'C0019', '990QT', 'GHTC', '5'],
['2021-03-31', 'C0019', '990QT', 'OFFSET', '15']], columns=['DATE','A_ID','R_ID','TYPE','I_VAL'] )
>>> df
DATE A_ID R_ID TYPE I_VAL
0 2021-03-31 A0019 990RT OFFSET 0.10
1 2021-03-31 A1019 990CT MARK 0.10
2 2021-03-31 A0019 990RT MARK 100
3 2021-03-31 A0019 990RT OFFSET 0.70
4 2021-03-31 A0029 990CT OFFSET 1.10
5 2021-03-31 A0029 990CT MARK 9.10
6 2021-03-31 A0019 990QT MARK 99.10
7 2021-03-31 C0019 990QT OFFSET 1
8 2021-03-31 C0019 990QT GHTC 5
9 2021-03-31 C0019 990QT OFFSET 15
Each NON OFFSET
(e.g. MARK
, GHTC
) row uniquely matches to ZERO or more OFFSET
rows based on the combination of DATE, A_ID, R_ID
. That is, there is a ONE-TO-MANY relation between NON OFFSET
(e.g. MARK
) to OFFSET
rows.
I need to complete an operation in two steps:
DATE, A_ID, R_ID
is same. Put the aggregated value as the value of I_VAL
in the NON OFFSET
row.TYPE
OFFSET
.The resultant DataFrame is:
# The rows with TYPE OFFSET are removed from resulting df.
# Keeping the OFFSET rows for explaining aggregation
# 0, 1, 2, 3, etc. are the indexes (row number) of the rows
DATE A_ID R_ID TYPE I_VAL
0 2021-03-31 A0019 990RT OFFSET 0.10
1 2021-03-31 A1019 990CT MARK 0.10 # no update, condition not met
2 2021-03-31 A0019 990RT MARK 100.80 # updated with sum of 0, self, and 3
3 2021-03-31 A0019 990RT OFFSET 0.70
4 2021-03-31 A0029 990CT OFFSET 1.10
5 2021-03-31 A0029 990CT MARK 10.20 # updated with sum of own value and 4
6 2021-03-31 A0019 990QT MARK 99.10 # no update, condition not met
7 2021-03-31 C0019 990QT OFFSET 1
8 2021-03-31 C0019 990QT GHTC 21 # updated with sum of self, 7, and 9
9 2021-03-31 C0019 990QT OFFSET 15
For step 2, I can do:
filtered_df = df[df.TYPE != 'OFFSET']
But, I couldn't figure out how would I aggregate the values? This post discusses a similar problem, but I couldn't modify it for my requirement.
Upvotes: 1
Views: 637
Reputation: 24314
For your step 1:
Firstly change dtype of 'I_VAL' column from string
to float
by using astype()
method:
df['I_VAL']=df['I_VAL'].astype(float)
Finally You can make use of groupby()
method and boolean masking:
df.loc[df['TYPE']!='OFFSET','I_VAL']=df.groupby(['DATE','A_ID','R_ID'],as_index=False,sort=False).transform('sum')[df['TYPE']!='OFFSET']['I_VAL']
Now if you print df
you will get your desired output:
#output
DATE A_ID R_ID TYPE I_VAL
0 2021-03-31 A0019 990RT OFFSET 0.1
1 2021-03-31 A1019 990CT MARK 0.1
2 2021-03-31 A0019 990RT MARK 100.8
3 2021-03-31 A0019 990RT OFFSET 0.7
4 2021-03-31 A0029 990CT OFFSET 1.1
5 2021-03-31 A0029 990CT MARK 10.2
6 2021-03-31 A0019 990QT MARK 99.1
7 2021-03-31 C0019 990QT OFFSET 1.0
8 2021-03-31 C0019 990QT GHTC 21.0
9 2021-03-31 C0019 990QT OFFSET 15.0
For your step 2:
Use Boolean Masking:
filtered_df = df[df.TYPE != 'OFFSET']
Now if you print filtered_df
you will get your desired output:
#output
DATE A_ID R_ID TYPE I_VAL
1 2021-03-31 A1019 990CT MARK 0.1
2 2021-03-31 A0019 990RT MARK 100.8
5 2021-03-31 A0029 990CT MARK 10.2
6 2021-03-31 A0019 990QT MARK 99.1
8 2021-03-31 C0019 990QT GHTC 21.0
Upvotes: 1