FooBar
FooBar

Reputation: 16478

Pandas: Select from date in MultiIndex

Say I have MultiIndex series as

date        foo
2006-01-01  1         12931926.310
            3         11084049.460
            5         10812205.359
            7          9031510.239
            9          5324054.903
2007-01-01  1         11086082.624
            3         12028419.560
            5         11957253.031
            7         10643307.061
            9          6034854.915

If it weren't a MultiIndex, I could select those with year 2007 through df.loc['2007']. How do I do this here? My natural guess was df.loc['2007', :], but that gave my an empty Series([], name: FINLWT21, dtype: float64).

Ultimate Goal

Ultimatively, I am also interested in replacing all rows for different dates than 2007 with the rows of the year 2007.

That is, my expected output is

date        foo
2006-01-01  1         11086082.624
            3         12028419.560
            5         11957253.031
            7         10643307.061
            9          6034854.915
2007-01-01  1         11086082.624
            3         12028419.560
            5         11957253.031
            7         10643307.061
            9          6034854.915

I tried to implement @unutbu's solution, but

mySeries.loc[dateIndex.year != 2007] = mySeries.loc[dateIndex.year == 2007]

will naturally set the values (due to non-existence on the RHS) to NaN. Usually, these issues are fixed by

mySeries.loc[dateIndex.year != 2007] = mySeries.loc[dateIndex.year == 2007].values

, but given that I have 10 values (and more in my real dataset) on the lefthand side, but only 5 on the right, I get

ValueError: cannot set using a list-like indexer with a different length than the value

The only alternative that comes to my mind now is to iterate over the first index and then use the previous command for each subgroup, but that doesn't appear to be the most efficient solution.

Upvotes: 3

Views: 4923

Answers (2)

unutbu
unutbu

Reputation: 879143

Given the Series

In [207]: series
Out[212]: 
date        foo
2006-01-01  1      12931926.310
            3      11084049.460
            5      10812205.359
            7       9031510.239
            9       5324054.903
2007-01-01  1      11086082.624
            3      12028419.560
            5      11957253.031
            7      10643307.061
            9       6034854.915
Name: val, dtype: float64

you could extract the date index with

dateindex = series.index.get_level_values('date')
# Ensure the dateindex is a DatetimeIndex (as opposed to a plain Index)
dateindex = pd.DatetimeIndex(dateindex)

Now selection of the rows where the year equals 2007 can be done with a boolean condition:

# select rows where year equals 2007
series2007 = series.loc[dateindex.year == 2007]

If the foo values cycle through the same values in the same order for each date, then you could replace all the values in the series with those from 2007 with

N = len(series)/len(series2007)
series[:] = np.tile(series.loc[dateindex.year == 2007].values, N)

An advantage of using np.tile and .values is that it will generate the desired array of values relatively quickly. A (possible) disadvantage is that this is ignoring the index so it relies on the assumption that the foo values cycle through the same values in the same order for each date.

The more robust (but slower) way is to use a join:

df = series.reset_index('date')
df2007 = df.loc[dateindex.year==2007]
df = df.join(df2007, rsuffix='_2007')
df = df[['date', 'val_2007']]
df = df.set_index(['date'], append=True)
df = df.swaplevel(0,1).sort_index()     

yields

In [304]: df.swaplevel(0,1).sort_index()
Out[304]: 
                    val_2007
date       foo              
2006-01-01 1    11086082.624
           3    12028419.560
           5    11957253.031
           7    10643307.061
           9     6034854.915
2007-01-01 1    11086082.624
           3    12028419.560
           5    11957253.031
           7    10643307.061
           9     6034854.915
2008-01-01 1    11086082.624
           3    12028419.560
           5    11957253.031
           7    10643307.061
           9     6034854.915

Upvotes: 7

Alexander
Alexander

Reputation: 109528

To select the values from the MultiIndex for the desired year (e.g. 2007), you can use:

target_year = 2007
df[[ts.year == target_year for ts in df.index.get_level_values(0)]]

If the date index is not in the form of a timestamp, it needs to be converted:

df[[pd.Timestamp(ts).year == target_year for ts in df.index.get_level_values(0)]]

Upvotes: 0

Related Questions