Bob Sleigh
Bob Sleigh

Reputation: 61

Panda dataframe groupby and summation, within group, across row values rather than by columns

There seems to be much online examples on the Dataframe groupby() method which seems to describe grouping with-by columns and data across multiple rows (Series) i.e. going "top-to-bottom"

Given 2 dataframes df_1 and df_2:

df_1:
                  Instru_1  Instru_2  Instru_3  Instru_5  Instru_6  Instru_7
2020-10-01        10        10        20        20        10        30

where row values are classification IDs and

df_2:
                   Instru_1  Instru_2  Instru_3  Instru_5  Instru_6  Instru_7
2020-10-01         0.1       0.2       0.2       0.2       0.2       0.1

where row values are weights summing to 1.0

Is groupby() still the way forward if I need to group across row (values) of df_1 where number of instruments can be indeterminate, to get a result df_result:

df_result:

                  10         20        30
2020-10-01        0.5        0.4       0.1

where: The columns are the classification IDs from df_1 record 
       The values are the sum for each classification ID from df_2

(e.g. Classif ID=10, element value = 0.1 + 0.2 + 0.2 = 0.5, Classif ID=20, element = 0.2 + 0.2 = 0.4 and ID=30, element = 0.1)

Is the quickest still to to perform multiple steps (merge df_1 and df_2 and process per row) ?
Step 1: Enum row 1 classification Ids and create df_result
Step 2: Enum row 2 and perform the summation per classification (this looks tricky!)

Any advice on best approach would be appreciated.. (or pointers to grouping across row values..) Thanks in advance..

Upvotes: 2

Views: 129

Answers (4)

Shubham Sharma
Shubham Sharma

Reputation: 71707

Let us try:

s1, s2 = df1.stack(), df2.stack()
out = s2.groupby([s2.droplevel(1).index, s2.index.map(s1)]).sum().unstack()

Details:

stack the dataframes df1 and df2 thereby creating multiindex series s1 and s2:

>>> s1
2020-10-01  Instru_1    10
            Instru_2    10
            Instru_3    20
            Instru_5    20
            Instru_6    10
            Instru_7    30
dtype: int64

>>> s2
2020-10-01  Instru_1    0.1
            Instru_2    0.2
            Instru_3    0.2
            Instru_5    0.2
            Instru_6    0.2
            Instru_7    0.1
dtype: float64

map the index of s2 with series s1 to get the columns for the new dataframe i.e 10, 20, 30...:

>>> s2.index.map(s1)
Int64Index([10, 10, 20, 20, 10, 30], dtype='int64')

Finally group series s2 on level=0 along with the above mapped columns and aggregate using sum followed by unstack to reshape:

>>> out
             10   20   30
2020-10-01  0.5  0.4  0.1

Upvotes: 1

anky
anky

Reputation: 75120

You could try concat with reshaping and groupby:

u = pd.concat((df1,df2),keys=['cols','rows'])
out = (u.unstack().T.reset_index(-1)
       .groupby(['level_1','cols'])['rows'].sum().unstack(fill_value=0))

print(out)

             10   20   30
2020-10-01  0.5  0.4  0.1

Sample Run for multiple cols:

enter image description here

Upvotes: 2

Rob Raymond
Rob Raymond

Reputation: 31226

  • prepare data frames by naming row and column indexes
  • unstack() to change to row based
  • join() to bring the two unstacked DFs together as rows
  • now it's simple groupby()
  • unstack() to change back to column based as required
df_1 = pd.read_csv(io.StringIO("""                  Instru_1  Instru_2  Instru_3  Instru_5  Instru_6  Instru_7
2020-10-01        10        10        20        20        10        30"""), sep="\s+")

df_2 = pd.read_csv(io.StringIO("""                   Instru_1  Instru_2  Instru_3  Instru_5  Instru_6  Instru_7
2020-10-01         0.1       0.2       0.2       0.2       0.2       0.1"""), sep="\s+")

df_1.columns.set_names("instrument", inplace=True)
df_1.index.set_names("date", inplace=True)
df_2.columns.set_names("instrument", inplace=True)
df_2.index.set_names("date", inplace=True)


(df_1.unstack().to_frame().rename(columns={0:"classification"})
 .join(df_2.unstack().to_frame().rename(columns={0:"weight"}))
 .groupby(["date","classification"]).sum()
 .unstack(1).droplevel(0, axis=1)
)

date 10 20 30
2020-10-01 0.5 0.4 0.1

Upvotes: 0

mck
mck

Reputation: 42402

A bit ugly, but here's a way to do it by unstacking the dataframes and joining, followed by a group by, sum and stacking:

df3 = df1.unstack().to_frame().join(df2.unstack().to_frame(), lsuffix='l', rsuffix='r')

df4 = df3.reset_index().groupby(['level_1', '0l']).sum('0r').reset_index().pivot_table('0r', 'level_1', '0l')

df4.index.name = None
df4.columns.name = None

print(df4)
             10   20   30
2020-10-01  0.5  0.4  0.1

Upvotes: 1

Related Questions