Reputation: 1114

DataFrame groupby on sequence of values not exceeding threshold

I have a data frame with two columns.

data=DataFrame({'a':[1,2,1,4,1,1,3,1,4,1,1,1],'b':[5,2,8,3,10,3,5,15,45,41,23,9]}) 

    a   b
0   1   5
1   2   2
2   1   8
3   4   3
4   1   10
5   1   3
6   3   5
7   1   15
8   4   45
9   1   41
10  1   23
11  1   9

I want to partition the DataFrame into groups based on the value of Column 'a' exceeding a given threshold. in this case the threshold is 3, so I want a new group whenever column 'a' exceeds the value of 3.

I want my groups to look like this

The only thing I can think of is applying cummax

data['max']=data.b.cummax()

but his will only works in limited situations where the values greater than 3 are always getting bigger.

Upvotes: 1

Answers (3)

Gustav Rasmussen

Reputation: 3961

Example:

a = np.array([[1,2,3], [4,5,6]])
print(a)

array([[1, 2, 3],
   [4, 5, 6]])

np.cumsum(a)

array([ 1,  3,  6, 10, 15, 21])

Upvotes: 0

James

Reputation: 36623

You can create a MultiIndex based on the values of a and apply it to the data frame. This will allow you to perform slicing operations on the data later.

There may be a faster way to do this, but you can create a boolean series of each time a is greater than a threshold (in this case 3), use cumsum to create an index for each group, count the occurrences in each group, and use that count to create an additional level to the index.

new_index = ((data.a>3).cumsum().value_counts(sort=False)
                       .apply(range).apply(pd.Series).stack().index)
data2 = data.set_index(new_index)
data2

# returns:
     a   b
0 0  1   5
  1  2   2
  2  1   8
1 0  4   3
  1  1  10
  2  1   3
  3  3   5
  4  1  15
2 0  4  45
  1  1  41
  2  1  23
  3  1   9

This allows you to access each of the desired groups by calling the first index.

data2.loc[0]

# returns:
   a  b
0  1  5
1  2  2
2  1  8

Upvotes: 0

akuiper

Reputation: 214957

You can use (data.a > 3).cumsum; This creates the desired group variable you need:

threshold = 3
for k, g in data.groupby((data.a > threshold).cumsum()):
    print(g, "\n")

#   a  b
#0  1  5
#1  2  2
#2  1  8 

#   a   b
#3  4   3
#4  1  10
#5  1   3
#6  3   5
#7  1  15 

#    a   b
#8   4  45
#9   1  41
#10  1  23
#11  1   9

Upvotes: 2

DataFrame groupby on sequence of values not exceeding threshold

Answers (3)

Related Questions