Indexing by row counts in a pandas dataframe

Question

I have a pandas dataframe with a two-element hierarchical index ("month" and "item_id"). Each row represents a particular item at a particular month, and has columns for several numeric measures of interest. The specifics are irrelevant, so we'll just say we have column X for our purposes here.

My problem stems from the fact that items vary in the months for which they have observations, which may or may not be contiguous. I need to calculate the average of X, across all items, for the 1st, 2nd, ..., n-th month in which there is an observation for that item.

In other words, the first row in my result should be the average across all items of the first row in the dataframe for each item, the second result row should be the average across all items of the second observation for that item, and so on.

Stated another way, if we were to take all the date-ordered rows for each item and index them from i=1,2,...,n, I need the average across all items of the values of rows 1,2,...,n. That is, I want the average of the first observation for each item across all items, the average of the second observation across all items, and so on.

How can I best accomplish this? I can't use the existing date index, so do I need to add another index to the dataframe (something like I describe in the previous paragraph), or is my only recourse to iterate across the rows for each item and keep a running average? This would work, but is not leveraging the power of pandas whatsoever.

Adding some example data:

  item_id  date          X       DUMMY_ROWS
  20       2010-11-01    16759   0  
           2010-12-01    16961   1
           2011-01-01    17126   2
           2011-02-01    17255   3
           2011-03-01    17400   4
           2011-04-01    17551   5
  21       2007-09-01        4   6
           2007-10-01        5   7
           2007-11-01        6   8
           2007-12-01       10   9
  22       2006-05-01       10   10
           2006-07-01       13   11
  23       2006-05-01        2   12
  24       2008-01-01        2   13
           2008-02-01        9   14
           2008-03-01       18   15
           2008-04-01       19   16
           2008-05-01       23   17
           2008-06-01       32   18

I've added a dummy rows column that does not exist in the data for explanatory purposes. The operation I'm describing would effectively give the mean of rows 0,6,10,12, and 13 (the first observation for each item), then the mean of rows 1,7,11,and 15 (the second observation for each item, excluding item 23 because it has only one observation), and so on.

cwharland · Accepted Answer

One option is to reset the index then group by id.

df_new = df.reset_index()
df_new.groupby(['item_id']).X.agg(np.mean)

this leaves your original df intact and gets you the mean across all months for each item id.

For your updated question (great example by the way) I think the approach would be to add an "item_sequence_id" I've done this in the path with similar data.

df.sort(['item_id', 'date'], inplace = True)

def sequence_id(item):
    item['seq_id'] = range(0,len(item)-1,1)
    return item

df_with_seq_id = df.groupby(['item_id']).apply(sequence_id)
df_with_seq_id.groupby(['seq_id']).agg(np.mean)

The idea here is that the seq_id allows you to identify the position of the data point in time per item_id assigning non-unique seq_id values to the items will allow you to group across multiple items. The context I've used this in before relates to users doing something first in a session. Using this ID structure I can identify all of the first, second, third, etc... actions taken by users regardless of their absolute time and user id.

Hopefully this is more of what you want.

Indexing by row counts in a pandas dataframe

Answers (2)

Related Questions