ely
ely

Reputation: 77464

Python Pandas How to assign groupby operation results back to columns in parent dataframe?

I have the following data frame in IPython, where each row is a single stock:

In [261]: bdata
Out[261]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21210 entries, 0 to 21209
Data columns:
BloombergTicker      21206  non-null values
Company              21210  non-null values
Country              21210  non-null values
MarketCap            21210  non-null values
PriceReturn          21210  non-null values
SEDOL                21210  non-null values
yearmonth            21210  non-null values
dtypes: float64(2), int64(1), object(4)

I want to apply a groupby operation that computes cap-weighted average return across everything, per each date in the "yearmonth" column.

This works as expected:

In [262]: bdata.groupby("yearmonth").apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())
Out[262]:
yearmonth
201204      -0.109444
201205      -0.290546

But then I want to sort of "broadcast" these values back to the indices in the original data frame, and save them as constant columns where the dates match.

In [263]: dateGrps = bdata.groupby("yearmonth")

In [264]: dateGrps["MarketReturn"] = dateGrps.apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/mnt/bos-devrnd04/usr6/home/espears/ws/Research/Projects/python-util/src/util/<ipython-input-264-4a68c8782426> in <module>()
----> 1 dateGrps["MarketReturn"] = dateGrps.apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())

TypeError: 'DataFrameGroupBy' object does not support item assignment

I realize this naive assignment should not work. But what is the "right" Pandas idiom for assigning the result of a groupby operation into a new column on the parent dataframe?

In the end, I want a column called "MarketReturn" than will be a repeated constant value for all indices that have matching date with the output of the groupby operation.

One hack to achieve this would be the following:

marketRetsByDate  = dateGrps.apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())

bdata["MarketReturn"] = np.repeat(np.NaN, len(bdata))

for elem in marketRetsByDate.index.values:
    bdata["MarketReturn"][bdata["yearmonth"]==elem] = marketRetsByDate.ix[elem]

But this is slow, bad, and unPythonic.

Upvotes: 128

Views: 126835

Answers (6)

Deadly Pointer
Deadly Pointer

Reputation: 312

If we want to compute functions on each group and assign back to columns, it's possible with apply:

df = pd.DataFrame({
    'x': list(range(6)),
    'group': [0, 1, 0, 1, 0, 1],
})
print(df)

def process_group(df):
    # It is not recommended to mutate the object we're iterating on, thus the copy:
    # https://pandas.pydata.org/docs/user_guide/gotchas.html#mutating-with-user-defined-function-udf-methods
    df = df.copy() 
    df['y'] = df['x'] / df['x'].sum()
    df['x'] = df['x'] * 2
    return df


print(df.groupby(['group']).apply(process_group).reset_index(drop=True))

df:

   x  group
0  0      0
1  1      1
2  2      0
3  3      1
4  4      0
5  5      1

result:

    x  group         y
0   0      0  0.000000
1   4      0  0.333333
2   8      0  0.666667
3   2      1  0.111111
4   6      1  0.333333
5  10      1  0.555556

Upvotes: 0

Han Zhang
Han Zhang

Reputation: 432

I did not find a way to make assignment to the original dataframe. So I just store the results from the groups and concatenate them. Then we sort the concatenated dataframe by index to get the original order as the input dataframe. Here is a sample code:

In [10]: df = pd.DataFrame({'month': np.random.randint(0,11, 100), 'A': np.random.randn(100), 'B': np.random.randn(100)})

In [11]: df.head()
Out[11]:
   month         A         B
0      4 -0.029106 -0.904648
1      2 -2.724073  0.492751
2      7  0.732403  0.689530
3      2  0.487685 -1.017337
4      1  1.160858 -0.025232

In [12]: res = []

In [13]: for month, group in df.groupby('month'):
    ...:     new_df = pd.DataFrame({
    ...:         'A^2+B': group.A ** 2 + group.B,
    ...:         'A+B^2': group.A + group.B**2
    ...:     })
    ...:     res.append(new_df)
    ...:

In [14]: res = pd.concat(res).sort_index()

In [15]: res.head()
Out[15]:
      A^2+B     A+B^2
0 -0.903801  0.789282
1  7.913327 -2.481270
2  1.225944  1.207855
3 -0.779501  1.522660
4  1.322360  1.161495

This method is pretty fast and extensible. You can derive any feature here.

Note: If the dataframe is too large, concat may cause you OOM error.

Upvotes: 2

seeiespi
seeiespi

Reputation: 3848

As a general rule when using groupby(), if you use the .transform() function pandas will return a table with the same length as your original. When you use other functions like .sum() or .first() then pandas will return a table where each row is a group.

I'm not sure how this works with apply but implementing elaborate lambda functions with transform can be fairly tricky so the strategy that I find most helpful is to create the variables I need, place them in the original dataset and then do my operations there.

If I understand what you're trying to do correctly first you can calculate the total market cap for each group:

bdata['group_MarketCap'] = bdata.groupby('yearmonth')['MarketCap'].transform('sum')

This will add a column called "group_MarketCap" to your original data which would contain the sum of market caps for each group. Then you can calculate the weighted values directly:

bdata['weighted_P'] = bdata['PriceReturn'] * (bdata['MarketCap']/bdata['group_MarketCap'])

And finally you would calculate the weighted average for each group using the same transform function:

bdata['MarketReturn'] = bdata.groupby('yearmonth')['weighted_P'].transform('sum')

I tend to build my variables this way. Sometimes you can pull off putting it all in a single command but that doesn't always work with groupby() because most of the time pandas needs to instantiate the new object to operate on it at the full dataset scale (i.e. you can't add two columns together if one doesn't exist yet).

Hope this helps :)

Upvotes: 66

Wes McKinney
Wes McKinney

Reputation: 105591

May I suggest the transform method (instead of aggregate)? If you use it in your original example it should do what you want (the broadcasting).

Upvotes: 33

Wouter Overmeire
Wouter Overmeire

Reputation: 69216

In [97]: df = pandas.DataFrame({'month': np.random.randint(0,11, 100), 'A': np.random.randn(100), 'B': np.random.randn(100)})

In [98]: df.join(df.groupby('month')['A'].sum(), on='month', rsuffix='_r')
Out[98]:
           A         B  month       A_r
0  -0.040710  0.182269      0 -0.331816
1  -0.004867  0.642243      1  2.448232
2  -0.162191  0.442338      4  2.045909
3  -0.979875  1.367018      5 -2.736399
4  -1.126198  0.338946      5 -2.736399
5  -0.992209 -1.343258      1  2.448232
6  -1.450310  0.021290      0 -0.331816
7  -0.675345 -1.359915      9  2.722156

Upvotes: 99

Garrett
Garrett

Reputation: 49866

While I'm still exploring all of the incredibly smart ways that apply concatenates the pieces it's given, here's another way to add a new column in the parent after a groupby operation.

In [236]: df
Out[236]: 
  yearmonth    return
0    201202  0.922132
1    201202  0.220270
2    201202  0.228856
3    201203  0.277170
4    201203  0.747347

In [237]: def add_mkt_return(grp):
   .....:     grp['mkt_return'] = grp['return'].sum()
   .....:     return grp
   .....: 

In [238]: df.groupby('yearmonth').apply(add_mkt_return)
Out[238]: 
  yearmonth    return  mkt_return
0    201202  0.922132    1.371258
1    201202  0.220270    1.371258
2    201202  0.228856    1.371258
3    201203  0.277170    1.024516
4    201203  0.747347    1.024516

Upvotes: 72

Related Questions