Reputation: 61
There seems to be much online examples on the Dataframe groupby() method which seems to describe grouping with-by columns and data across multiple rows (Series) i.e. going "top-to-bottom"
Given 2 dataframes df_1 and df_2:
df_1:
Instru_1 Instru_2 Instru_3 Instru_5 Instru_6 Instru_7
2020-10-01 10 10 20 20 10 30
where row values are classification IDs and
df_2:
Instru_1 Instru_2 Instru_3 Instru_5 Instru_6 Instru_7
2020-10-01 0.1 0.2 0.2 0.2 0.2 0.1
where row values are weights summing to 1.0
Is groupby() still the way forward if I need to group across row (values) of df_1 where number of instruments can be indeterminate, to get a result df_result:
df_result:
10 20 30
2020-10-01 0.5 0.4 0.1
where: The columns are the classification IDs from df_1 record
The values are the sum for each classification ID from df_2
(e.g. Classif ID=10, element value = 0.1 + 0.2 + 0.2 = 0.5, Classif ID=20, element = 0.2 + 0.2 = 0.4 and ID=30, element = 0.1)
Is the quickest still to to perform multiple steps (merge df_1 and df_2 and process per row) ?
Step 1: Enum row 1 classification Ids and create df_result
Step 2: Enum row 2 and perform the summation per classification (this looks tricky!)
Any advice on best approach would be appreciated.. (or pointers to grouping across row values..) Thanks in advance..
Upvotes: 2
Views: 129
Reputation: 71707
Let us try:
s1, s2 = df1.stack(), df2.stack()
out = s2.groupby([s2.droplevel(1).index, s2.index.map(s1)]).sum().unstack()
Details:
stack
the dataframes df1
and df2
thereby creating multiindex series s1
and s2
:
>>> s1
2020-10-01 Instru_1 10
Instru_2 10
Instru_3 20
Instru_5 20
Instru_6 10
Instru_7 30
dtype: int64
>>> s2
2020-10-01 Instru_1 0.1
Instru_2 0.2
Instru_3 0.2
Instru_5 0.2
Instru_6 0.2
Instru_7 0.1
dtype: float64
map
the index of s2
with series s1
to get the columns for the new dataframe i.e 10, 20, 30...
:
>>> s2.index.map(s1)
Int64Index([10, 10, 20, 20, 10, 30], dtype='int64')
Finally group
series s2
on level=0
along with the above mapped columns and aggregate using sum
followed by unstack
to reshape
:
>>> out
10 20 30
2020-10-01 0.5 0.4 0.1
Upvotes: 1
Reputation: 75120
You could try concat with reshaping and groupby:
u = pd.concat((df1,df2),keys=['cols','rows'])
out = (u.unstack().T.reset_index(-1)
.groupby(['level_1','cols'])['rows'].sum().unstack(fill_value=0))
print(out)
10 20 30
2020-10-01 0.5 0.4 0.1
Sample Run for multiple cols:
Upvotes: 2
Reputation: 31226
unstack()
to change to row basedjoin()
to bring the two unstacked DFs together as rowsgroupby()
unstack()
to change back to column based as requireddf_1 = pd.read_csv(io.StringIO(""" Instru_1 Instru_2 Instru_3 Instru_5 Instru_6 Instru_7
2020-10-01 10 10 20 20 10 30"""), sep="\s+")
df_2 = pd.read_csv(io.StringIO(""" Instru_1 Instru_2 Instru_3 Instru_5 Instru_6 Instru_7
2020-10-01 0.1 0.2 0.2 0.2 0.2 0.1"""), sep="\s+")
df_1.columns.set_names("instrument", inplace=True)
df_1.index.set_names("date", inplace=True)
df_2.columns.set_names("instrument", inplace=True)
df_2.index.set_names("date", inplace=True)
(df_1.unstack().to_frame().rename(columns={0:"classification"})
.join(df_2.unstack().to_frame().rename(columns={0:"weight"}))
.groupby(["date","classification"]).sum()
.unstack(1).droplevel(0, axis=1)
)
date | 10 | 20 | 30 |
---|---|---|---|
2020-10-01 | 0.5 | 0.4 | 0.1 |
Upvotes: 0
Reputation: 42402
A bit ugly, but here's a way to do it by unstacking the dataframes and joining, followed by a group by, sum and stacking:
df3 = df1.unstack().to_frame().join(df2.unstack().to_frame(), lsuffix='l', rsuffix='r')
df4 = df3.reset_index().groupby(['level_1', '0l']).sum('0r').reset_index().pivot_table('0r', 'level_1', '0l')
df4.index.name = None
df4.columns.name = None
print(df4)
10 20 30
2020-10-01 0.5 0.4 0.1
Upvotes: 1