Reputation: 16478
Say I have MultiIndex series as
date foo
2006-01-01 1 12931926.310
3 11084049.460
5 10812205.359
7 9031510.239
9 5324054.903
2007-01-01 1 11086082.624
3 12028419.560
5 11957253.031
7 10643307.061
9 6034854.915
If it weren't a MultiIndex, I could select those with year 2007
through df.loc['2007']
. How do I do this here? My natural guess was df.loc['2007', :]
, but that gave my an empty Series([], name: FINLWT21, dtype: float64)
.
Ultimatively, I am also interested in replacing all rows for different dates than 2007
with the rows of the year 2007
.
That is, my expected output is
date foo
2006-01-01 1 11086082.624
3 12028419.560
5 11957253.031
7 10643307.061
9 6034854.915
2007-01-01 1 11086082.624
3 12028419.560
5 11957253.031
7 10643307.061
9 6034854.915
I tried to implement @unutbu's solution, but
mySeries.loc[dateIndex.year != 2007] = mySeries.loc[dateIndex.year == 2007]
will naturally set the values (due to non-existence on the RHS) to NaN
. Usually, these issues are fixed by
mySeries.loc[dateIndex.year != 2007] = mySeries.loc[dateIndex.year == 2007].values
, but given that I have 10
values (and more in my real dataset) on the lefthand side, but only 5
on the right, I get
ValueError: cannot set using a list-like indexer with a different length than the value
The only alternative that comes to my mind now is to iterate over the first index and then use the previous command for each subgroup, but that doesn't appear to be the most efficient solution.
Upvotes: 3
Views: 4923
Reputation: 879143
Given the Series
In [207]: series
Out[212]:
date foo
2006-01-01 1 12931926.310
3 11084049.460
5 10812205.359
7 9031510.239
9 5324054.903
2007-01-01 1 11086082.624
3 12028419.560
5 11957253.031
7 10643307.061
9 6034854.915
Name: val, dtype: float64
you could extract the date
index with
dateindex = series.index.get_level_values('date')
# Ensure the dateindex is a DatetimeIndex (as opposed to a plain Index)
dateindex = pd.DatetimeIndex(dateindex)
Now selection of the rows where the year equals 2007 can be done with a boolean condition:
# select rows where year equals 2007
series2007 = series.loc[dateindex.year == 2007]
If the foo
values cycle through the same values in the same order for each date,
then you could replace all the values in the series with those from 2007 with
N = len(series)/len(series2007)
series[:] = np.tile(series.loc[dateindex.year == 2007].values, N)
An advantage of using np.tile
and .values
is that it will generate the desired array of values relatively quickly. A (possible) disadvantage is that this is ignoring the index so it relies on the assumption that the foo
values cycle through the same values in the same order for each date.
The more robust (but slower) way is to use a join:
df = series.reset_index('date')
df2007 = df.loc[dateindex.year==2007]
df = df.join(df2007, rsuffix='_2007')
df = df[['date', 'val_2007']]
df = df.set_index(['date'], append=True)
df = df.swaplevel(0,1).sort_index()
yields
In [304]: df.swaplevel(0,1).sort_index()
Out[304]:
val_2007
date foo
2006-01-01 1 11086082.624
3 12028419.560
5 11957253.031
7 10643307.061
9 6034854.915
2007-01-01 1 11086082.624
3 12028419.560
5 11957253.031
7 10643307.061
9 6034854.915
2008-01-01 1 11086082.624
3 12028419.560
5 11957253.031
7 10643307.061
9 6034854.915
Upvotes: 7
Reputation: 109528
To select the values from the MultiIndex for the desired year (e.g. 2007), you can use:
target_year = 2007
df[[ts.year == target_year for ts in df.index.get_level_values(0)]]
If the date index is not in the form of a timestamp, it needs to be converted:
df[[pd.Timestamp(ts).year == target_year for ts in df.index.get_level_values(0)]]
Upvotes: 0