Emma
Emma

Reputation: 1297

Slicing a pandas dataframe by position on level 2 of a MultIindex

I have a pandas DataFrame with a MultiIndex as follows:

>>> import pandas as pd
>>> category = ['bar', 'bar', 'bar', 'bar', 'bar', 'baz', 'baz', 'baz', 'baz',
                'baz', 'baz', 'foo', 'foo', 'foo']
>>> timestamp = ['2017-01-01 09:00:00', '2017-01-01 09:01:00', '2017-01-01 09:02:00', 
                '2017-01-01 09:03:00', '2017-01-01 09:04:00', '2016-11-18 03:18:00', 
                '2016-11-18 03:19:00', '2016-11-18 03:20:00', '2016-11-18 03:21:00', 
                '2016-11-18 03:22:00', '2016-11-18 03:23:00', '2017-02-03 20:39:00', 
                '2017-02-03 20:40:00', '2017-02-03 20:41:00']
>>> values = [1,1,2,2,2,35,3,3,4,4,4,28,28,28]
>>> tuples = list(zip(*[category,timestamp]))
>>> index = pd.MultiIndex.from_tuples(tuples, names=['category', 'timestamp'])
>>> df = pd.DataFrame(values,index=index,columns=['values'])
>>> df
                                  values
category timestamp                  
bar      2017-01-01 09:00:00       1
         2017-01-01 09:01:00       1
         2017-01-01 09:02:00       2
         2017-01-01 09:03:00       2
         2017-01-01 09:04:00       2
baz      2016-11-18 03:18:00      35
         2016-11-18 03:19:00       3
         2016-11-18 03:20:00       3
         2016-11-18 03:21:00       4
         2016-11-18 03:22:00       4
         2016-11-18 03:23:00       4
foo      2017-02-03 20:39:00      28
         2017-02-03 20:40:00      28
         2017-02-03 20:41:00      28

For each category, I want to find the cumulative sum of the number of times the value column changes, like this:

                              values changed  cum_changes
category timestamp                                       
bar      2017-01-01 09:00:00       1   False            0
         2017-01-01 09:01:00       1   False            0
         2017-01-01 09:02:00       2    True            1
         2017-01-01 09:03:00       2   False            1
         2017-01-01 09:04:00       2   False            1
baz      2016-11-18 03:18:00      35   False            0
         2016-11-18 03:19:00       3    True            1
         2016-11-18 03:20:00       3   False            1
         2016-11-18 03:21:00       4    True            2
         2016-11-18 03:22:00       4   False            2
         2016-11-18 03:23:00       4   False            2
foo      2017-02-03 20:39:00      28   False            0
         2017-02-03 20:40:00      28   False            0
         2017-02-03 20:41:00      28   False            0

I tried doing this:

df["changes"] = False
df.iloc[idx[:,1:],1] = df.iloc[idx[:,1:],0] == df.iloc[idx[:,:-1],0]   #This doesn't work
df["cum_changes"] = df["changed"].groupby(level=[0]).cumsum().astype(int)

But unfortunately the second line doesn't work. It's analogous to the way you would multi-index by value using loc, but apparently iloc doesn't handle a MultiIndex the same way. I can't index by label because the timestamps are different in each group, and I can't use head() because the length of each group is different. Is it possible to do positional indexing on the second level of a MultiIndex?

What I actually need is the "cum_changes" column, the "changed" column is just an intermediate step. If there's a different way to calculate the "cum_changes" column I'm interested to hear it. I know it can be done by iterating over the category column, but it seems like it should be possible to keep this vectorized, so I'm looking for a solution that does not involve looping.

I found this related question, but I don't believe it applies since the solution isn't actually indexing by position but rather finding the labels that correspond to the given positions and indexing by the labels: Slice MultiIndex pandas DataFrame by position

Upvotes: 1

Views: 280

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210852

you can use diff() as @Psidom has already said in the comment:

In [25]: df['x'] = df.groupby(level=0)['values'] \
                     .apply(lambda x: x.diff().fillna(0).ne(0).cumsum())

In [26]: df
Out[26]:
                              values  x
category timestamp
bar      2017-01-01 09:00:00       1  0
         2017-01-01 09:01:00       1  0
         2017-01-01 09:02:00       2  1
         2017-01-01 09:03:00       2  1
         2017-01-01 09:04:00       2  1
baz      2016-11-18 03:18:00      35  0
         2016-11-18 03:19:00       3  1
         2016-11-18 03:20:00       3  1
         2016-11-18 03:21:00       4  2
         2016-11-18 03:22:00       4  2
         2016-11-18 03:23:00       4  2
foo      2017-02-03 20:39:00      28  0
         2017-02-03 20:40:00      28  0
         2017-02-03 20:41:00      28  0

Upvotes: 1

Related Questions