user1739581
user1739581

Reputation: 85

Pandas: find groups by index if consecutively numbered

I am trying to find a list of tuples with start and end values (i.e. rows) from a df2 dataframe looking over the index (the first or zero column df2[0]). df2 example:

COL0  COL1 COL2
  4    x    y    # start 'tuple x' of COL1
  5    i    j
  6    n    m    # end 'tuple n'
 14    f    a    # start 'tuple f'
 15    e    b    # end 'tuple e'
 ...

So COL0 consecutive values will form a group. If the next row is not consecutive (e.g. 6-14) then a new group starts. A selection could be the following:

Crit_a = df2[0][0] + 1 == df2[0][1]

As output, I am looking for a new df3 with per row the following:

COL0  COL1 COL2 COL3 COL4 ...
  4    x    y    n    m   # start values and end values of COL1 and COL2
 14    f    a    e    b

I was looking at SO here and other locations. Thank you for you suggestions.

Upvotes: 1

Views: 406

Answers (2)

Alexander
Alexander

Reputation: 109626

Not exactly your desired output, but perhaps more intuitive?

I create a column named group_no to label the consecutive values from COL0. I differenced the columns, located values where this difference was not one, and then did a cumsum on the result. The first element is ambiguous (it is NaN when differenced, so I check if its value plus one equals the second value. If so, the first value is continuous and assigned a value of 1. If not, it is not continuous and assigned a value of 0.

df = df.assign(group_no = (df.COL0.diff() != 1).cumsum())
df.group_no.iat[0] = 1 if df.COL0.iat[0] + 1 == df.COL0.iat[1] else 0
df_new = df.groupby('group_no').agg(
    {'COL0': ['first'], 
     'COL1': ['first', 'last'], 
     'COL2': ['first', 'last']})
>>> df_new
          COL2       COL0  COL1     
         first last first first last
group_no                            
1            y    m     4     x    n
2            a    b    14     f    e

The agg function takes a dictionary, so the resulting order of the columns can be arbitrary. To order the resulting columns, you could do it explicitly, e.g.:

df_new[[('COL0', 'first'),
        ('COL1', 'first'),
        ('COL1', 'last'),
        ('COL2', 'first'),
        ('COL2', 'last')]]

This may also work:

n = 3  # First three columns of original dataframe.
df_new.loc[:, pd.IndexSlice[df.columns[:n], :]]  

Upvotes: 1

Allen Qin
Allen Qin

Reputation: 19957

First create a group key for rows belonging to the same group.

df['COL0'] = \
    (df.COL0.rolling(2,min_periods=1)
        .apply(lambda x: x[-1] if int(x[-1]-x[0])!=1 else np.nan)
        .ffill()
    )

Then group by the key and find the start and end row and expand them as columns. Finally rename the columns.

df2 = \
    (df[['COL1','COL2']].groupby(df.COL0)
        .apply(lambda x: np.array([x.iloc[0],x.iloc[-1]]).flatten())
        .apply(pd.Series)
        .rename_axis(lambda x: 'COL'+str(x+1),axis=1)
    )


df2
Out[178]: 
     COL1 COL2 COL3 COL4
COL0                    
4.0     x    y    n    m
14.0    f    a    e    b

Upvotes: 1

Related Questions