Reputation: 3035
So I have the following DataFrame df:
The frame contains two groups of data that are sorted within that group.
Group 1 is from index 359 to 365 inclusive
Group 2 is from index 366 to 371 inclusive
I want to separate them into the two groups. There may be more than two groups. The logic I am applying is whenever the next STEPS_ID is less than the current STEPS_ID, this marks the end of the group.
I am easily able to get this pointer by df.STEPS_ID <= df.STEPS_ID.shift(-1)
Is there an elegant pandas way to achieve this easily possibly using vectorized operations rather than for loop?
This seems to be a common enough problem that I am sure there must be a well-defined algorithm to solve these kinds of problems. I would also appreciate if you guys could guide me in reading up on the theoretical basis for such algorithms.
Upvotes: 1
Views: 2008
Reputation: 879401
There is more than one way to "separate things into groups". One way would be to make a list of groups. But that is not the ideal way when dealing with a Pandas DataFrame. Once you have a list, you are forced to loop over the list in a Python loop. Those are comparatively slow compared to native Pandas operations.
Assuming you have enough memory, a better way would be to add an column or index to the DataFrame:
import pandas as pd
df = pd.DataFrame({'STEPS_ID':range(1107,1113)*2})
df['GROUP'] = (df['STEPS_ID'] < df['STEPS_ID'].shift(1)).astype('int').cumsum()
# df.set_index('GROUP', inplace=True, append=True)
print(df)
yields
STEPS_ID GROUP
0 1107 0
1 1108 0
2 1109 0
3 1110 0
4 1111 0
5 1112 0
6 1107 1
7 1108 1
8 1109 1
9 1110 1
10 1111 1
11 1112 1
Now you can do aggregation/transformation operations on each group by calling
df.groupby('GROUP')....
Upvotes: 4