Merging rows pandas dataframe

Question

I have a pandas dataframe that looks like this:

df =pd.DataFrame([[0,10,0,'A','A',6,7],[11,21,1,'A','A',8,9],[0,13,1,'B','B',11,13],[0,12,1,'C','C',14,15],[13,14,0,'C','C',16,18]],columns=['Start Sample','End Sample','Value','Start Name','End Name','Start Time','End Time'])

df
Out[18]: 
   Start Sample  End Sample  Value Start Name End Name  Start Time  End Time
0             0          10      0          A        A           6         7
1            11          21      1          A        A           8         9
2             0          13      1          B        B          11        13
3             0          12      1          C        C          14        15
4            13          14      0          C        C          16        18

I would like to group consecutive rows having the same Value if the difference between Start Time of row i+1 and End Time of row i is < 3

For example rows 1,2,3 are consecutive rows having the same value.

df['Start Time'].iloc[2] - df['End Time'].iloc[1] is = 2
df['Start Time'].iloc[3] - df['End Time'].iloc[2] is = 1

So they all should be merged. I would like that these rows become:

df2
Out[25]: 
   Start Sample  End Sample  Value Start Name End Name  Start Time  End Time
0             0          10      0          A        A           6         7
1            11          12      1          A        C           8        15
2            13          14      0          C        C          16        18

Please note that the new merged row should have:

1) Start Sample = to the Start Sample of the first row merged
2) End Sample = to the End Sample of the last row merged
3) Value = to the common value
4) Start Name = to the Start Name of the first row merged
5) End Name = to the End Name of the last row merged
6) Start Time = to the Start Name of the first row merged
7) End Name = to the End Name of the last row merged

Phil Cooper · Accepted Answer

First some code for you to consider then some explanation. The approach here is to break into subsets based on your "Value" and work on those sub-dataframes.

def agg(series):
    if series.name.startswith('Start'):
        return series.iloc[0]
    return series.iloc[-1]

subsets = [subset.apply(agg) for _, subset in 
             df.groupby((df['Value']!=df['Value'].shift(1)).cumsum())]

pd.concat(subsets, axis=1).T

The "tricky" part is df['Value']!=df['Value'].shift(1)).cumsum(). This finds when the "Value" changes. We will groupby that but first the cumsum() gives the unique values.

After the groupby, you are iterating through the subsets of dataframes you are interested in. From here you can do a great many things which is why this is flexible.

For each subset, the apply function will apply to each series (column). In your case, you are looking for one of two values based on the column name so one function (agg here) can be applied to each series.

Edit: The above test for change only included one of the two criteria OP specified. Including both is easy enough but extends the logic so it should be broken out a little. I was already pushing the bounds of an unreasonable oneliner for that logic. so the groupby condition should be:

val_chg = df['Value'] != df['Value'].shift(1)
time_chg = df['Start Time']-df['End Time'].shift(1) >=3

df.groupby((val_chg | time_chg).cumsum())

Merging rows pandas dataframe

Answers (2)

Related Questions