Reputation: 5559
I have a pandas dataframe that looks like this:
df =pd.DataFrame([[0,10,0,'A','A',6,7],[11,21,1,'A','A',8,9],[0,13,1,'B','B',11,13],[0,12,1,'C','C',14,15],[13,14,0,'C','C',16,18]],columns=['Start Sample','End Sample','Value','Start Name','End Name','Start Time','End Time'])
df
Out[18]:
Start Sample End Sample Value Start Name End Name Start Time End Time
0 0 10 0 A A 6 7
1 11 21 1 A A 8 9
2 0 13 1 B B 11 13
3 0 12 1 C C 14 15
4 13 14 0 C C 16 18
I would like to group consecutive rows having the same Value
if the difference between Start Time of row i+1
and End Time of row i
is < 3
For example rows 1,2,3 are consecutive rows having the same value.
df['Start Time'].iloc[2] - df['End Time'].iloc[1] is = 2
df['Start Time'].iloc[3] - df['End Time'].iloc[2] is = 1
So they all should be merged. I would like that these rows become:
df2
Out[25]:
Start Sample End Sample Value Start Name End Name Start Time End Time
0 0 10 0 A A 6 7
1 11 12 1 A C 8 15
2 13 14 0 C C 16 18
Please note that the new merged row should have:
1) Start Sample = to the Start Sample of the first row merged
2) End Sample = to the End Sample of the last row merged
3) Value = to the common value
4) Start Name = to the Start Name of the first row merged
5) End Name = to the End Name of the last row merged
6) Start Time = to the Start Name of the first row merged
7) End Name = to the End Name of the last row merged
Upvotes: 1
Views: 841
Reputation: 5877
First some code for you to consider then some explanation. The approach here is to break into subsets based on your "Value" and work on those sub-dataframes.
def agg(series):
if series.name.startswith('Start'):
return series.iloc[0]
return series.iloc[-1]
subsets = [subset.apply(agg) for _, subset in
df.groupby((df['Value']!=df['Value'].shift(1)).cumsum())]
pd.concat(subsets, axis=1).T
The "tricky" part is df['Value']!=df['Value'].shift(1)).cumsum()
. This finds when the "Value" changes. We will groupby that but first the cumsum()
gives the unique values.
After the groupby
, you are iterating through the subsets of dataframes you are interested in. From here you can do a great many things which is why this is flexible.
For each subset, the apply
function will apply to each series (column). In your case, you are looking for one of two values based on the column name so one function (agg
here) can be applied to each series.
Edit: The above test for change only included one of the two criteria OP specified. Including both is easy enough but extends the logic so it should be broken out a little. I was already pushing the bounds of an unreasonable oneliner for that logic. so the groupby condition should be:
val_chg = df['Value'] != df['Value'].shift(1)
time_chg = df['Start Time']-df['End Time'].shift(1) >=3
df.groupby((val_chg | time_chg).cumsum())
Upvotes: 2
Reputation: 101
There are probably better ways to do it but here is iterrows()
approach:
df =pd.DataFrame([[0,10,0,'A','A',6,7],[11,21,1,'A','A',8,9],[0,13,1,'B','B',11,13],[0,12,1,'C','C',14,15],[13,14,0,'C','C',16,18]],columns=['Start Sample','End Sample','Value','Start Name','End Name','Start Time','End Time'])
df['keep'] = ''
active_row = None
for i, row in df.iterrows():
if active_row is None:
active_row = i
df.loc[i,'keep'] = 1
continue
if row['Value'] != df.loc[active_row,'Value']:
active_row = i
df.loc[i,'keep'] = 1
continue
elif row['Start Time'] - df.loc[active_row,'End Time'] >= 3:
active_row = i
df.loc[i,'keep'] = 1
continue
df.loc[active_row,'End Time'] = row['End Time']
df.loc[active_row,'End Sample'] = row['End Sample']
df.loc[active_row,'End Name'] = row['End Name']
df.loc[i,'keep'] = 0
final_df=df[df.keep == 1].drop('keep',axis=1)
It's iterating through rows, remebering the last meaningfull row and updating it during the loop. Each loop classifies a row as keep (1) or not to keep(0), and we use it to manually filter them out by the end.
Upvotes: 0