B. Shieh
B. Shieh

Reputation: 321

Pandas Resampler Key Error

My first ever question on StackOverflow. Until now I've always been able to find answers to my questions with a search. Hopefully not embarrassing myself by asking a duplicate question.

I am resampling a pandas dataframe. I then want to loop through the dataframes in the resampler object to extract some information.

However, when I use the keys returned by the resampler.groups.keys() I get a key error when there is no data for that week. This seems inconsistent to me. I would have expected to get an empty dataframe or for the keys() method or to not get a key for that week's group at all.

import pandas as pd

df = pd.read_csv('debug.csv', index_col = 'DATETIME', parse_dates=True)

by_week = df.resample('W-SUN')
by_week.groups

Gives:

{Timestamp('2017-02-26 00:00:00', offset='W-SUN'): 1,
 Timestamp('2017-03-05 00:00:00', offset='W-SUN'): 1,
 Timestamp('2017-03-12 00:00:00', offset='W-SUN'): 1,
 Timestamp('2017-03-19 00:00:00', offset='W-SUN'): 8}

Then sum just to show there is no data in the middle two weeks:

print by_week.sum()

                   ID    DATA
DATETIME                     
2017-02-26  1020754.0    74.0
2017-03-05        NaN     NaN
2017-03-12        NaN     NaN
2017-03-19  7151408.0  2526.0

Show keys for resampler groups:

for key in sorted(by_week.groups.keys(), reverse=True):
    print key

2017-03-19 00:00:00
2017-03-12 00:00:00
2017-03-05 00:00:00
2017-02-26 00:00:00

Try to do something for each group dataframe. First week is fine, but the second week craps out. Why is the keys() method returning an invalid key?

for key in sorted(by_week.groups.keys(), reverse=True):
    df = by_week.get_group(key)
    print df.head()

                              ID  DATA
DATETIME                              
2017-03-18 22:41:10.859  1021626   384
2017-03-18 23:45:18.773  1021627   375
2017-03-18 23:45:35.309  1021628   359
2017-03-18 23:46:45.303  1021629   188
2017-03-19 01:02:23.554  1021633   373


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-6-a57723281f49> in <module>()
      1 for key in sorted(by_week.groups.keys(), reverse=True):
----> 2     df = by_week.get_group(key)
      3     print df.head()

//anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in get_group(self, name, obj)
    585         inds = self._get_index(name)
    586         if not len(inds):
--> 587             raise KeyError(name)
    588 
    589         return obj.take(inds, axis=self.axis, convert=False)

KeyError: Timestamp('2017-03-12 00:00:00', offset='W-SUN')

My workaround below. Also appreciate any feedback on if there is a more appropriate way to handle this. This skips the middle two weeks without data. Is there a fundamentally better way to iterate over each week's data?

for key in sorted(by_week.groups.keys(), reverse=True):
    try:
        df = by_week.get_group(key)
    except:
        continue
    print df.head()

                              ID  DATA
DATETIME                              
2017-03-18 22:41:10.859  1021626   384
2017-03-18 23:45:18.773  1021627   375
2017-03-18 23:45:35.309  1021628   359
2017-03-18 23:46:45.303  1021629   188
2017-03-19 01:02:23.554  1021633   373
                              ID  DATA
DATETIME                              
2017-02-21 13:42:01.133  1020754    74

Edit/Update: To address response below about using the built in iterator. My original code did use the built in iterator but I got this.

import pandas as pd
df = pd.read_csv('debug.csv', index_col = 'DATETIME', parse_dates=True)
by_week = df.resample('W-SUN')

for key, df in by_week:
    print df.head()

gives:

Traceback (most recent call last):
  File "debug_sampler.py", line 10, in <module>
    for key, df in by_week:
  File "<redacted path>/pandas/core/groupby.py", line 600, in __iter__
    return self.grouper.get_iterator(self.obj, axis=self.axis)
AttributeError: 'NoneType' object has no attribute 'get_iterator'

Interestingly, if I use groupby instead, it's fine. But I hate to give up the convenience of the resample method (e.g. resampling by week ending on an aribtrary day).

import pandas as pd
df = pd.read_csv('debug.csv', index_col = 'DATETIME', parse_dates=True)

by_week_groupby = df.groupby(lambda x: x.week)

for key, df in by_week_groupby:
    print df.head()

gives:

                              ID  DATA
DATETIME                              
2017-02-21 13:42:01.133  1020754    74
                              ID  DATA
DATETIME                              
2017-03-19 17:01:01.352  1021625   428
2017-03-18 22:41:10.859  1021626   384
2017-03-18 23:45:18.773  1021627   375
2017-03-18 23:45:35.309  1021628   359
2017-03-18 23:46:45.303  1021629   188

The installed version of pandas:

print pd.__version__
0.18.1

Upvotes: 3

Views: 1084

Answers (1)

piRSquared
piRSquared

Reputation: 294258

Don't force your own iteration through the groupby object when pandas has one already (though not obvious)

for key, df in byweek:
    print(df.head())

Upvotes: 1

Related Questions