Reputation: 2765
I have a file containing duplicate timestamps, maximum two for each timestamp, actually they are not duplicate, it is just the second timestamp needs to add a millisecond timestamp. For example, I am having these in the file,
....
2011/1/4 9:14:00
2011/1/4 9:15:00
2011/1/4 9:15:01
2011/1/4 9:15:01
2011/1/4 9:15:02
2011/1/4 9:15:02
2011/1/4 9:15:03
2011/1/4 9:15:03
2011/1/4 9:15:04
....
I would like to change them into
2011/1/4 9:14:00
2011/1/4 9:15:00
2011/1/4 9:15:01
2011/1/4 9:15:01.500
2011/1/4 9:15:02
2011/1/4 9:15:02.500
2011/1/4 9:15:03
2011/1/4 9:15:03.500
2011/1/4 9:15:04
....
what is the most efficient way to perform such task?
Upvotes: 1
Views: 1778
Reputation: 128958
Setup
In [69]: df = DataFrame(dict(time = x))
In [70]: df
Out[70]:
time
0 2013-01-01 09:01:00
1 2013-01-01 09:01:00
2 2013-01-01 09:01:01
3 2013-01-01 09:01:01
4 2013-01-01 09:01:02
5 2013-01-01 09:01:02
6 2013-01-01 09:01:03
7 2013-01-01 09:01:03
8 2013-01-01 09:01:04
9 2013-01-01 09:01:04
Find the locations where the difference in time from the previous row is 0 seconds
In [71]: mask = (df.time-df.time.shift()) == np.timedelta64(0,'s')
In [72]: mask
Out[72]:
0 False
1 True
2 False
3 True
4 False
5 True
6 False
7 True
8 False
9 True
Name: time, dtype: bool
Set theose locations to use an offset of 5 milliseconds (In your question you used 500 but could be anything). This requires numpy >= 1.7. (Not that this syntax will be changing in 0.13 to allow a more direct df.loc[mask,'time'] += pd.offsets.Milli(5)
In [73]: df.loc[mask,'time'] = df.time[mask].apply(lambda x: x+pd.offsets.Milli(5))
In [74]: df
Out[74]:
time
0 2013-01-01 09:01:00
1 2013-01-01 09:01:00.005000
2 2013-01-01 09:01:01
3 2013-01-01 09:01:01.005000
4 2013-01-01 09:01:02
5 2013-01-01 09:01:02.005000
6 2013-01-01 09:01:03
7 2013-01-01 09:01:03.005000
8 2013-01-01 09:01:04
9 2013-01-01 09:01:04.005000
Upvotes: 2
Reputation: 28946
So this algorithm should work very well... I'm just having a hell of a time with numpy's datetime datatypes.
In [154]: df
Out[154]:
0
0 2011/1/4 9:14:00
1 2011/1/4 9:15:00
2 2011/1/4 9:15:01
3 2011/1/4 9:15:01
4 2011/1/4 9:15:02
5 2011/1/4 9:15:02
6 2011/1/4 9:15:03
7 2011/1/4 9:15:03
8 2011/1/4 9:15:04
In [155]: ((dt.diff() == 0) * .005)
Out[155]:
0 0.000
1 0.000
2 0.000
3 0.005
4 0.000
5 0.005
6 0.000
7 0.005
8 0.000
Name: 0, dtype: float64
And the idea is to add those two together. Of course, one is datetime64
and the other is float64
. For whatever reasons, np.timedelta64
doesn't operate on arrays? Anyway if you can sort out the dtype issues that will work.
Upvotes: 1
Reputation: 28380
Assuming - as you have shown in your example that they are sequential:
lasttimestamp = None
for ts = readtimestamp(infile): # I will leave this to you
if ts == lasttimestamp:
ts += inc_by # and this
lasttimestamp = ts
writetimestamp(outfile, ts) # and this to
Upvotes: 0