Reputation: 5105
I have dataframe df
:
Id timestamp data sig events1 Start Peak Timediff Datadiff
104513 104754 2012-03-21 16:23:21.323 19.5 1.0 0.0 1.0 0.0 28732.920 0.5
104514 104755 2012-03-21 16:23:21.423 20.0 -1.0 0.0 0.0 1.0 0.100 0.5
104623 104864 2012-03-22 04:27:04.550 19.5 0.0 0.0 0.0 0.0 43423.127 -0.5
104630 104871 2012-03-22 04:27:11.670 19.5 -1.0 0.0 0.0 1.0 7.120 0.0
105147 105388 2012-03-23 06:12:24.523 19.0 -1.0 0.0 0.0 0.0 92712.853 -0.5
105148 105389 2012-03-23 06:12:24.623 18.5 1.0 1.0 0.0 0.0 0.100 -0.5
For the rows where Timediff == 0.100
, and timestamp
have the same hour, minute and second (eg. 16:23:21.323
and 16:23:21.423
), remove the first of the pair, ie. 16:23:21.323
(the previous row).How can I do this?
Expected outcome:
Id timestamp data sig events1 Start Peak Timediff Datadiff
104514 104755 2012-03-21 16:23:21.423 20.0 -1.0 0.0 0.0 1.0 0.100 0.5
104623 104864 2012-03-22 04:27:04.550 19.5 0.0 0.0 0.0 0.0 43423.127 -0.5
104630 104871 2012-03-22 04:27:11.670 19.5 -1.0 0.0 0.0 1.0 7.120 0.0
105148 105389 2012-03-23 06:12:24.623 18.5 1.0 1.0 0.0 0.0 0.100 -0.5
Upvotes: 0
Views: 82
Reputation: 862681
Idea is grouping by floor datetimes per seconds by Series.dt.floor
and then remove previous row if exist at least one matched mask m1
:
#convert to datetimes
df['timestamp'] = pd.to_datetime(df['timestamp'])
#test 0.1
m1 = df['Timediff'].eq(0.1)
#test previous values of m1
m2 = m1.shift(-1, fill_value=False)
#test if in groups floored by seconds is at least one True in m1
m3 = m1.groupby([df['timestamp'].dt.floor('S')]).transform('any')
#filter chained m3 with m2 and invert mask
df = df[~(m3 & m2)]
print (df)
Id timestamp data sig events1 Start Peak \
104514 104755 2012-03-21 16:23:21.423 20.0 -1.0 0.0 0.0 1.0
104623 104864 2012-03-22 04:27:04.550 19.5 0.0 0.0 0.0 0.0
104630 104871 2012-03-22 04:27:11.670 19.5 -1.0 0.0 0.0 1.0
105148 105389 2012-03-23 06:12:24.623 18.5 1.0 1.0 0.0 0.0
Timediff Datadiff
104514 0.100 0.5
104623 43423.127 -0.5
104630 7.120 0.0
105148 0.100 -0.5
EDIT: Answer form comment with next constions:
m1 = df3['Timediff'].eq(0.100) & df3['Start'].ne(1) & df3['Start'].shift(1).ne(1)
Upvotes: 1
Reputation:
I have added some values to your dataframe to receive same hours ,minutes, and sec for records with value =0.100
df = pd.DataFrame({'data':['16:23:21.323 19.5','16:23:21.423 20.0','04:27:04.550 19.5',
'04:27:11.670 19.5','06:12:24.523 19.0','06:12:24.623 18.5',
'16:23:21.323 19.5','06:12:24.523 19.0'],
'timestamp':['2012-03-21','2012-03-21','2012-03-22','2012-03-22','2012-03-23','2012-03-23',
'2012-03-23','2012-03-23'],
'timediff':[28732.920,0.100,43423.127,7.120,0.100,0.100,92712.853,0.100]})
df['h_m_s'] = df['data'].str.extract(r'(.*) ')
data timediff timestamp h_m_s
0 16:23:21.323 19.5 28732.920 2012-03-21 16:23:21.323
1 16:23:21.423 20.0 0.100 2012-03-21 16:23:21.423
2 04:27:04.550 19.5 43423.127 2012-03-22 04:27:04.550
3 04:27:11.670 19.5 7.120 2012-03-22 04:27:11.670
4 06:12:24.523 19.0 0.100 2012-03-23 06:12:24.523
5 06:12:24.623 18.5 0.100 2012-03-23 06:12:24.623
6 16:23:21.323 19.5 92712.853 2012-03-23 16:23:21.323
7 06:12:24.523 19.0 0.100 2012-03-23 06:12:24.523
df_same = df[(df.timediff==0.100)]
df = df[~(df.timediff==0.100)]
df_same = df_same.drop_duplicates(subset=['h_m_s'],keep='first')
df=pd.concat([df,df_same])
print(df)
data timediff timestamp h_m_s
0 16:23:21.323 19.5 28732.920 2012-03-21 16:23:21.323
2 04:27:04.550 19.5 43423.127 2012-03-22 04:27:04.550
3 04:27:11.670 19.5 7.120 2012-03-22 04:27:11.670
6 16:23:21.323 19.5 92712.853 2012-03-23 16:23:21.323
1 16:23:21.423 20.0 0.100 2012-03-21 16:23:21.423
4 06:12:24.523 19.0 0.100 2012-03-23 06:12:24.523
5 06:12:24.623 18.5 0.100 2012-03-23 06:12:24.623
Upvotes: 1
Reputation: 7509
There may be a simpler way to achieve this, but I believe this works given your conditions:
# reset index to make our life easier
df = df.reset_index()
# get rows with small timediffs
small_timediffs = df.loc[df.Timediff <= 0.1]
# get rows immediately above rows with small timediffs
prev_small_timediffs = df.loc[small_timediffs.index-1]
# helper function to allow comparison of timestamps without milisseconds
remove_milisseconds = lambda x: x.split('.')[0]
# compare if timestamps from rows with small timediffs
# and rows above them are the same
stamps = small_timediffs.timestamp.apply(remove_milisseconds).values
prev_stamps = prev_small_timediffs.timestamp.apply(remove_milisseconds).values
cond = (stamps == prev_stamps)
# extract results based on both conditions
# (small timediff and same timestamp)
result = df.loc[~df.index.isin(small_timediffs.loc[cond].index-1)]
Upvotes: 1