JacobWuzHere
JacobWuzHere

Reputation: 913

Indexing/selecting by year in Pandas when MultiIndex is in use

Longtime lurker trying to more actively engage in the community here.

Here goes:

When I set the index of the dataframe as just a single-level DatetimeIndex, everything works as expected. When I Index by the below, I get the following returns:

Input:

sample_Email_click2 = sample_Email_click.reset_index(drop=True)
sample_Email_click2.set_index('Date_Comb', inplace=True, drop=False)
sample_Email_click2

Output:

               Date_Comb              Campaign_CD

Date_Comb 2012-10-11 05:18:05 2012-10-11 05:18:05 2842

             2012-09-27 15:04:12    2012-09-27 15:04:12 2805

Yields a dataframe in the format that I am expecting...

sample_Email_click2.ix["2012"].head(2).ix[:,:2].index

<class 'pandas.tseries.index.DatetimeIndex'>
enter code here`[2012-10-11 05:18:05, 2012-09-27 15:04:12]
Length: 2, Freq: None, Timezone: None`

However, when I try to introduce a MultiIndex and use the .ix and .xs methods, things seem to go awry.

print sample_Email_click2.index.levels[0]
print sample_Email_click2.index.levels[1]

Output:

<class 'pandas.tseries.index.DatetimeIndex'>
[2012-03-03 01:25:55, ..., 2014-07-31 23:49:53]
Length: 562933, Freq: None, Timezone: None

Index([u'10003341', u'10006447', u'10009143', u'10009338', u'10010524', u'10018143', u'10020892', u'10038064', u'10048634', u'10055227', u'10056611', u'10057317', u'10063204', u'10070907', u'10072846', u'10078439', u'10078571', u'10079289', u'10086611', u'10097987', u'10101451', u'10102952', u'10110609', u'10118756', u'10127396', u'10130253', u'10144913', u'10149927', u'10157760', u'10172201', u'10173908', u'10179552', u'10184175', u'10189897', u'10190549', u'10191493', u'10197319', u'10210260', u'10217133', u'10217986', u'10218016', u'10232329', u'10240641', u'10244519', u'10244922', u'10248604', u'10248909', u'10257722', u'10271347', u'10282045', u'10284030', u'10289296', u'10295972', u'10305340', u'10326711', u'10332290', u'10343444', u'10343976', u'10357733', u'10360593', u'10366234', u'10376405', u'10376704', u'10379122', u'10386833', u'10391989', u'10398309', u'10405057', u'10406239', u'10411259', u'10414566', u'10418969', u'10419291', u'10424138', u'10435470', u'10440233', u'10448304', u'10466519', u'10470115', u'10476204', u'10480159', u'10481054', u'10484122', u'10487410', u'10487540', u'10490374', u'10495338', u'105004023', u'105004035', u'105004071', u'105004316', u'105013017', u'105014110', u'105014144', u'105016206', u'105017011', u'105017739', u'105018514', u'105018717', u'105018810', ...], dtype='object')

For reference, my index names are:

sample_Email_click2.index.names
FrozenList([u'Date_Comb', u'MATCHED_CUSTOMER_ID'])

So I was hoping that any of the following should/could work:

sample_Email_click2.ix["2012"]
sample_Email_click2.ix["2012", :]
sample_Email_click2.loc["2012"]
sample_Email_click2.xs("2012", level="Date_Comb")
sample_Email_click2.xs("2012", level=0)

And to make sure that I wasn't totally crazy, this does work:

sample_Email_click2.xs("10240641", level=1).ix["2012"]

                       Date_Comb              Campaign_CD
Date_Comb       
2012-08-09 17:01:35 2012-08-09 17:01:35       2648
2012-12-12 17:44:54 2012-12-12 17:44:54       3056

Can someone please help me figure out how to use a DatetimeIndex inside of a MultiIndex without having to query on the precise timestamp (since I suspect that pd.to_datetime(2012, 08, 09, etc.) would work....

Thank you so much in advance.

Upvotes: 1

Views: 824

Answers (1)

unutbu
unutbu

Reputation: 879411

I don't think this form of "fuzzy indexing" is possible (yet?) when using a MultIndex. However, as a workaround, you could extract the single-level index from the multiindex, and use its get_loc method to find the integer indices corresponding to the desired dates in 2012:

For example,

index = pd.date_range('2012-3-3', '2014-07-31', freq='M')
N = len(index)
df = pd.DataFrame({'id': range(N), 'CD': np.random.randint(10, N)}, index=index)
df = df.set_index(['id'], append=True)
dtindex = df.index.get_level_values(0)
result = df.iloc[dtindex.get_loc('2012')]
print(result)

yields

               CD
           id    
2012-03-31 0   22
2012-04-30 1   22
2012-05-31 2   22
2012-06-30 3   22
2012-07-31 4   22
2012-08-31 5   22
2012-09-30 6   22
2012-10-31 7   22
2012-11-30 8   22
2012-12-31 9   22

PS: pd.IndexSlice effectively replaces xs. Although they are not the solution here, you may find multiIndexing using slicers useful to you in the future.

Upvotes: 1

Related Questions