Antonio L.
Antonio L.

Reputation: 59

Subtract pandas series with different indices

I have a dataframe where I have a collection of the following entries: (Date, Volume).

I would like to create a new dataframe column where the Volume column is subtracted with the mean of the monthly volume. I would like what is the way to achieve that in pandas.

In below you can find the setup of the above:

import pandas as pd
import io
import numpy as np

df = pd.read_csv(io.StringIO(csv_file_content))
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')

# Get the means for each year/month pair
month_means = df.groupby([df.index.year, df.index.month])['Volume'].mean().round(2)

# I would like to subtract the Volume with the mean of its month.
df['spread_monthly'] = df['Volume'] - month_means[zip(df.index.year, df.index.month)]

Upvotes: 1

Views: 1883

Answers (3)

Anurag Dabas
Anurag Dabas

Reputation: 24322

You can also do this by groupby() and transform() method:

df['spread_monthly']=df['Volume']-df.groupby([df.index.month,df.index.year])['Volume'].transform('mean').values

Upvotes: 0

Utsav
Utsav

Reputation: 5918

Just to add another option with vectorized solution.

We can use groupby with pd.Grouper with freq=M to get the volume spread per month. Setting the Date column to index is a choice.

Toy Example

df = pd.DataFrame({
    'Date': pd.date_range('2020.10.10', periods=12, freq='15D'),
    'Volume': np.arange(1,13)
})
df

    Date    Volume
0   2020-10-10  1
1   2020-10-25  2
2   2020-11-09  3
3   2020-11-24  4
4   2020-12-09  5
5   2020-12-24  6
6   2021-01-08  7
7   2021-01-23  8
8   2021-02-07  9
9   2021-02-22  10
10  2021-03-09  11
11  2021-03-24  12

Code

df['spread_monthly'] = df.groupby([pd.Grouper(key='Date', freq='M')]).transform('mean')
df['spread_monthly'] = df.spread_monthly - df.Volume
df

Output

    Date    Volume  spread_monthly
0   2020-10-10  1   0.5
1   2020-10-25  2   -0.5
2   2020-11-09  3   0.5
3   2020-11-24  4   -0.5
4   2020-12-09  5   0.5
5   2020-12-24  6   -0.5
6   2021-01-08  7   0.5
7   2021-01-23  8   -0.5
8   2021-02-07  9   0.5
9   2021-02-22  10  -0.5
10  2021-03-09  11  0.5
11  2021-03-24  12  -0.5

Upvotes: 0

Carlos E Jimenez
Carlos E Jimenez

Reputation: 91

It seems to be complaining about the indexing with the grouped month_means and the original (datetime) index of df['Volume']. To avoid problems with the indexing, you can remove the different indices using x.values for each series.

Do df['spread_monthly'] = df['Volume'].values - month_means[zip(df.index.year, df.index.month)].values instead.

Upvotes: 2

Related Questions