nilsinelabore
nilsinelabore

Reputation: 5105

How to filter data points with very close timestamps

I have dataframe df:

          Id    timestamp               data    sig     events1 Start   Peak    Timediff    Datadiff
104513  104754  2012-03-21 16:23:21.323 19.5    1.0     0.0     1.0     0.0     28732.920   0.5
104514  104755  2012-03-21 16:23:21.423 20.0    -1.0    0.0     0.0     1.0     0.100       0.5
104623  104864  2012-03-22 04:27:04.550 19.5    0.0     0.0     0.0     0.0     43423.127   -0.5
104630  104871  2012-03-22 04:27:11.670 19.5    -1.0    0.0     0.0     1.0     7.120       0.0
105147  105388  2012-03-23 06:12:24.523 19.0    -1.0    0.0     0.0     0.0     92712.853   -0.5
105148  105389  2012-03-23 06:12:24.623 18.5    1.0     1.0     0.0     0.0     0.100       -0.5

For the rows where Timediff == 0.100, and timestamp have the same hour, minute and second (eg. 16:23:21.323 and 16:23:21.423), remove the first of the pair, ie. 16:23:21.323(the previous row).How can I do this?

Expected outcome:

          Id    timestamp               data    sig     events1 Start   Peak    Timediff    Datadiff

104514  104755  2012-03-21 16:23:21.423 20.0    -1.0    0.0     0.0     1.0     0.100       0.5
104623  104864  2012-03-22 04:27:04.550 19.5    0.0     0.0     0.0     0.0     43423.127   -0.5
104630  104871  2012-03-22 04:27:11.670 19.5    -1.0    0.0     0.0     1.0     7.120       0.0
105148  105389  2012-03-23 06:12:24.623 18.5    1.0     1.0     0.0     0.0     0.100       -0.5

Upvotes: 0

Views: 82

Answers (3)

jezrael
jezrael

Reputation: 862681

Idea is grouping by floor datetimes per seconds by Series.dt.floor and then remove previous row if exist at least one matched mask m1:

#convert to datetimes
df['timestamp'] = pd.to_datetime(df['timestamp'])

#test 0.1
m1 = df['Timediff'].eq(0.1)
#test previous values of m1
m2 = m1.shift(-1, fill_value=False)
#test if in groups floored by seconds is at least one True in m1
m3 = m1.groupby([df['timestamp'].dt.floor('S')]).transform('any')
#filter chained m3 with m2 and invert mask
df = df[~(m3 & m2)]
print (df)
            Id               timestamp  data  sig  events1  Start  Peak  \
104514  104755 2012-03-21 16:23:21.423  20.0 -1.0      0.0    0.0   1.0   
104623  104864 2012-03-22 04:27:04.550  19.5  0.0      0.0    0.0   0.0   
104630  104871 2012-03-22 04:27:11.670  19.5 -1.0      0.0    0.0   1.0   
105148  105389 2012-03-23 06:12:24.623  18.5  1.0      1.0    0.0   0.0   

         Timediff  Datadiff  
104514      0.100       0.5  
104623  43423.127      -0.5  
104630      7.120       0.0  
105148      0.100      -0.5  

EDIT: Answer form comment with next constions:

m1 = df3['Timediff'].eq(0.100) & df3['Start'].ne(1) & df3['Start'].shift(1).ne(1)

Upvotes: 1

user8560167
user8560167

Reputation:

I have added some values to your dataframe to receive same hours ,minutes, and sec for records with value =0.100

df = pd.DataFrame({'data':['16:23:21.323 19.5','16:23:21.423 20.0','04:27:04.550 19.5',
                               '04:27:11.670 19.5','06:12:24.523 19.0','06:12:24.623 18.5',
                               '16:23:21.323 19.5','06:12:24.523 19.0'],
                       'timestamp':['2012-03-21','2012-03-21','2012-03-22','2012-03-22','2012-03-23','2012-03-23',
                                    '2012-03-23','2012-03-23'],
                       'timediff':[28732.920,0.100,43423.127,7.120,0.100,0.100,92712.853,0.100]})

df['h_m_s'] = df['data'].str.extract(r'(.*) ')

                data   timediff   timestamp         h_m_s
0  16:23:21.323 19.5  28732.920  2012-03-21  16:23:21.323
1  16:23:21.423 20.0      0.100  2012-03-21  16:23:21.423
2  04:27:04.550 19.5  43423.127  2012-03-22  04:27:04.550
3  04:27:11.670 19.5      7.120  2012-03-22  04:27:11.670
4  06:12:24.523 19.0      0.100  2012-03-23  06:12:24.523
5  06:12:24.623 18.5      0.100  2012-03-23  06:12:24.623
6  16:23:21.323 19.5  92712.853  2012-03-23  16:23:21.323
7  06:12:24.523 19.0      0.100  2012-03-23  06:12:24.523

df_same = df[(df.timediff==0.100)]
df = df[~(df.timediff==0.100)]
df_same = df_same.drop_duplicates(subset=['h_m_s'],keep='first')
df=pd.concat([df,df_same])
print(df)
                data   timediff   timestamp         h_m_s
0  16:23:21.323 19.5  28732.920  2012-03-21  16:23:21.323
2  04:27:04.550 19.5  43423.127  2012-03-22  04:27:04.550
3  04:27:11.670 19.5      7.120  2012-03-22  04:27:11.670
6  16:23:21.323 19.5  92712.853  2012-03-23  16:23:21.323
1  16:23:21.423 20.0      0.100  2012-03-21  16:23:21.423
4  06:12:24.523 19.0      0.100  2012-03-23  06:12:24.523
5  06:12:24.623 18.5      0.100  2012-03-23  06:12:24.623

Upvotes: 1

jfaccioni
jfaccioni

Reputation: 7509

There may be a simpler way to achieve this, but I believe this works given your conditions:

# reset index to make our life easier
df = df.reset_index()

# get rows with small timediffs
small_timediffs = df.loc[df.Timediff <= 0.1]
# get rows immediately above rows with small timediffs
prev_small_timediffs = df.loc[small_timediffs.index-1]

# helper function to allow comparison of timestamps without milisseconds
remove_milisseconds = lambda x: x.split('.')[0]

# compare if timestamps from rows with small timediffs 
# and rows above them are the same 
stamps = small_timediffs.timestamp.apply(remove_milisseconds).values
prev_stamps = prev_small_timediffs.timestamp.apply(remove_milisseconds).values
cond = (stamps == prev_stamps)

# extract results based on both conditions
# (small timediff and same timestamp)
result = df.loc[~df.index.isin(small_timediffs.loc[cond].index-1)]

Upvotes: 1

Related Questions