moustachio
moustachio

Reputation: 3004

Indexing by row counts in a pandas dataframe

I have a pandas dataframe with a two-element hierarchical index ("month" and "item_id"). Each row represents a particular item at a particular month, and has columns for several numeric measures of interest. The specifics are irrelevant, so we'll just say we have column X for our purposes here.

My problem stems from the fact that items vary in the months for which they have observations, which may or may not be contiguous. I need to calculate the average of X, across all items, for the 1st, 2nd, ..., n-th month in which there is an observation for that item.

In other words, the first row in my result should be the average across all items of the first row in the dataframe for each item, the second result row should be the average across all items of the second observation for that item, and so on.

Stated another way, if we were to take all the date-ordered rows for each item and index them from i=1,2,...,n, I need the average across all items of the values of rows 1,2,...,n. That is, I want the average of the first observation for each item across all items, the average of the second observation across all items, and so on.

How can I best accomplish this? I can't use the existing date index, so do I need to add another index to the dataframe (something like I describe in the previous paragraph), or is my only recourse to iterate across the rows for each item and keep a running average? This would work, but is not leveraging the power of pandas whatsoever.


Adding some example data:

  item_id  date          X       DUMMY_ROWS
  20       2010-11-01    16759   0  
           2010-12-01    16961   1
           2011-01-01    17126   2
           2011-02-01    17255   3
           2011-03-01    17400   4
           2011-04-01    17551   5
  21       2007-09-01        4   6
           2007-10-01        5   7
           2007-11-01        6   8
           2007-12-01       10   9
  22       2006-05-01       10   10
           2006-07-01       13   11
  23       2006-05-01        2   12
  24       2008-01-01        2   13
           2008-02-01        9   14
           2008-03-01       18   15
           2008-04-01       19   16
           2008-05-01       23   17
           2008-06-01       32   18

I've added a dummy rows column that does not exist in the data for explanatory purposes. The operation I'm describing would effectively give the mean of rows 0,6,10,12, and 13 (the first observation for each item), then the mean of rows 1,7,11,and 15 (the second observation for each item, excluding item 23 because it has only one observation), and so on.

Upvotes: 4

Views: 2637

Answers (2)

moustachio
moustachio

Reputation: 3004

Here's an alternative method for this I finally figured out (which assumes we don't care about the actual dates for the purposes of calculating the mean). Recall the method proposed by @cwharland:

def sequence_id(item):
    item['seq'] = range(0,len(item),1)
    return item

shrinkWithSeqID_old = df.groupby(level='item_id').apply(sequence_id)

Testing this on a 10,000 row subset of the data frame:

%timeit -n10 dfWithSeqID_old = shrink.groupby(level='item_id').apply(sequence_id)
10 loops, best of 3: 301 ms per loop

It turns out we can simplify things by remembering that pandas' default behavior (i.e. without specifying an index column) is to generate a numeric index for a dataframe numbered from 0 to n (the number of rows in the frame). We can leverage this like so:

dfWithSeqID_new = df.groupby(level='item_id').apply(lambda x: x.reset_index(drop=True))

The only difference in the output is that we have a new, unlabeled numeric index with the same content as the 'seq' column used in the previous answer, BUT it's almost 4 times faster (I can't compare the methods for the full 13 million row dataframe, as the first methods was resulting in memory errors):

%timeit -n10 dfWithSeqID_new = df.groupby(level='item_id').apply(lambda x: x.reset_index(drop=True))
10 loops, best of 3: 77.2 ms per loop

Calculating the average as in my original question is only slightly different. The original method was:

dfWithSeqID_old.groupby('seq').agg(np.mean).head()

But now we simply have to account for the fact that we're using the new unlabeled index instead of the 'seq' column:

dfWithSeqID_new.mean(level=1).head()

The result is the same.

Upvotes: 0

cwharland
cwharland

Reputation: 6713

One option is to reset the index then group by id.

df_new = df.reset_index()
df_new.groupby(['item_id']).X.agg(np.mean) 

this leaves your original df intact and gets you the mean across all months for each item id.

For your updated question (great example by the way) I think the approach would be to add an "item_sequence_id" I've done this in the path with similar data.

df.sort(['item_id', 'date'], inplace = True)

def sequence_id(item):
    item['seq_id'] = range(0,len(item)-1,1)
    return item

df_with_seq_id = df.groupby(['item_id']).apply(sequence_id)
df_with_seq_id.groupby(['seq_id']).agg(np.mean)

The idea here is that the seq_id allows you to identify the position of the data point in time per item_id assigning non-unique seq_id values to the items will allow you to group across multiple items. The context I've used this in before relates to users doing something first in a session. Using this ID structure I can identify all of the first, second, third, etc... actions taken by users regardless of their absolute time and user id.

Hopefully this is more of what you want.

Upvotes: 3

Related Questions