ResidentSleeper
ResidentSleeper

Reputation: 2495

Pandas: Drop and count consecutive duplicates with condition

I want to drop and count duplicates in column val when val equal to 1.

Then set start to be the first row and end to be the last row of consecutive duplicates.

df = pd.DataFrame()
df['start'] = [1, 2, 3, 4, 5, 6, 18, 30, 31] 
df['end'] = [2, 3, 4, 5, 6, 18, 30, 31, 32]
df['val'] = [1 , 1, 1, 1, 1, 12, 12, 1, 1]

df

   start  end  val
0      1    2    1
1      2    3    1
2      3    4    1
3      4    5    1
4      5    6    1
5      6   18   12
6     18   30   12
7     30   31    1
8     31   32    1

Expected Result

   start  end  val
0      1    6    5
1      6   18   12
2     18   30   12
3     30   32    2

I tried. df[~((df.val==1) & (df.val == df.val.shift(1)) & (df.val == df.val.shift(-1)))]

  start  end  val
0      1    2    1
4      5    6    1
5      6   18   12
6     18   30   12
7     30   31    1
8     31   32    1

but I can't figure out how to complete my expected result, any suggestion?

Upvotes: 2

Views: 559

Answers (3)

manwithfewneeds
manwithfewneeds

Reputation: 1167

You could also do a two-liner with a mask to groupby:

m = (df.val.ne(1) | df.val.ne(df.val.shift())).cumsum()
df = df.groupby(m).agg({'start': 'first', 'end': 'last', 'val': 'last'})

Upvotes: 2

Quant Christo
Quant Christo

Reputation: 1430

Solution by @jezrael is perfect, but here is slightly different approach:

df['aux'] = (df['val'] != df['val'].shift()).cumsum()
df.loc[df['val'] == 1, 'end'] = df[df['val'] == 1].groupby('aux')['end'].transform('last')
df.loc[df['val'] == 1, 'val'] = df.groupby('aux')['val'].transform('sum')
df = df.drop_duplicates(subset=df.columns.difference(['start']), keep='first')
df = df.drop(columns=['aux'])

Upvotes: 0

jezrael
jezrael

Reputation: 863226

Use:

#mask by condition
m = df.val==1
#consecutive groups
g = m.ne(m.shift()).cumsum()
#filter by condition and aggregate per groups
df1 = df.groupby(g[m]).agg({'start':'first', 'end':'last', 'val':'sum'})
#concat together, for correct order create index by g
df = pd.concat([df1, df.set_index(g)[~m.values]]).sort_index().reset_index(drop=True)
print (df)
   start  end  val
0      1    6    5
1      6   18   12
2     18   30   12
3     30   32    2

Upvotes: 3

Related Questions