mikeL
mikeL

Reputation: 1114

DataFrame groupby on sequence of values not exceeding threshold

I have a data frame with two columns.

data=DataFrame({'a':[1,2,1,4,1,1,3,1,4,1,1,1],'b':[5,2,8,3,10,3,5,15,45,41,23,9]}) 

    a   b
0   1   5
1   2   2
2   1   8
3   4   3
4   1   10
5   1   3
6   3   5
7   1   15
8   4   45
9   1   41
10  1   23
11  1   9

I want to partition the DataFrame into groups based on the value of Column 'a' exceeding a given threshold. in this case the threshold is 3, so I want a new group whenever column 'a' exceeds the value of 3.

I want my groups to look like this

    a   b
0   1   5
1   2   2
2   1   8


    a   b
0   4   3
1   1   10
2   1   3
3   3   5
4   1   15

    a   b
0   4   45
1   1   41
2   1   23
3   1   9

The only thing I can think of is applying cummax

data['max']=data.b.cummax()

but his will only works in limited situations where the values greater than 3 are always getting bigger.

Upvotes: 1

Views: 553

Answers (3)

Gustav Rasmussen
Gustav Rasmussen

Reputation: 3961

Example:

I:

a = np.array([[1,2,3], [4,5,6]])
print(a)

O:

array([[1, 2, 3],
   [4, 5, 6]])

I:

np.cumsum(a)

O:

array([ 1,  3,  6, 10, 15, 21])

Upvotes: 0

James
James

Reputation: 36623

You can create a MultiIndex based on the values of a and apply it to the data frame. This will allow you to perform slicing operations on the data later.

There may be a faster way to do this, but you can create a boolean series of each time a is greater than a threshold (in this case 3), use cumsum to create an index for each group, count the occurrences in each group, and use that count to create an additional level to the index.

new_index = ((data.a>3).cumsum().value_counts(sort=False)
                       .apply(range).apply(pd.Series).stack().index)
data2 = data.set_index(new_index)
data2

# returns:
     a   b
0 0  1   5
  1  2   2
  2  1   8
1 0  4   3
  1  1  10
  2  1   3
  3  3   5
  4  1  15
2 0  4  45
  1  1  41
  2  1  23
  3  1   9

This allows you to access each of the desired groups by calling the first index.

data2.loc[0]

# returns:
   a  b
0  1  5
1  2  2
2  1  8

Upvotes: 0

akuiper
akuiper

Reputation: 214957

You can use (data.a > 3).cumsum; This creates the desired group variable you need:

threshold = 3
for k, g in data.groupby((data.a > threshold).cumsum()):
    print(g, "\n")

#   a  b
#0  1  5
#1  2  2
#2  1  8 

#   a   b
#3  4   3
#4  1  10
#5  1   3
#6  3   5
#7  1  15 

#    a   b
#8   4  45
#9   1  41
#10  1  23
#11  1   9 

Upvotes: 2

Related Questions