Pandas: find groups by index if consecutively numbered

Question

I am trying to find a list of tuples with start and end values (i.e. rows) from a df2 dataframe looking over the index (the first or zero column df2[0]). df2 example:

COL0  COL1 COL2
  4    x    y    # start 'tuple x' of COL1
  5    i    j
  6    n    m    # end 'tuple n'
 14    f    a    # start 'tuple f'
 15    e    b    # end 'tuple e'
 ...

So COL0 consecutive values will form a group. If the next row is not consecutive (e.g. 6-14) then a new group starts. A selection could be the following:

Crit_a = df2[0][0] + 1 == df2[0][1]

As output, I am looking for a new df3 with per row the following:

COL0  COL1 COL2 COL3 COL4 ...
  4    x    y    n    m   # start values and end values of COL1 and COL2
 14    f    a    e    b

I was looking at SO here and other locations. Thank you for you suggestions.

Alexander · Accepted Answer

Not exactly your desired output, but perhaps more intuitive?

I create a column named group_no to label the consecutive values from COL0. I differenced the columns, located values where this difference was not one, and then did a cumsum on the result. The first element is ambiguous (it is NaN when differenced, so I check if its value plus one equals the second value. If so, the first value is continuous and assigned a value of 1. If not, it is not continuous and assigned a value of 0.

df = df.assign(group_no = (df.COL0.diff() != 1).cumsum())
df.group_no.iat[0] = 1 if df.COL0.iat[0] + 1 == df.COL0.iat[1] else 0
df_new = df.groupby('group_no').agg(
    {'COL0': ['first'], 
     'COL1': ['first', 'last'], 
     'COL2': ['first', 'last']})
>>> df_new
          COL2       COL0  COL1     
         first last first first last
group_no                            
1            y    m     4     x    n
2            a    b    14     f    e

The agg function takes a dictionary, so the resulting order of the columns can be arbitrary. To order the resulting columns, you could do it explicitly, e.g.:

df_new[[('COL0', 'first'),
        ('COL1', 'first'),
        ('COL1', 'last'),
        ('COL2', 'first'),
        ('COL2', 'last')]]

This may also work:

n = 3  # First three columns of original dataframe.
df_new.loc[:, pd.IndexSlice[df.columns[:n], :]]

Pandas: find groups by index if consecutively numbered

Answers (2)

Related Questions