strongvigilance
strongvigilance

Reputation: 103

Modifying timestamps in pandas to make index unique

I'm working with financial data, which is recorded at irregular intervals. Some of the timestamps are duplicates, which is making analysis tricky. This is an example of the data - note there are four 2016-08-23 00:00:17.664193 timestamps:

In [167]: ts
Out[168]: 
                               last  last_sz      bid      ask
datetime                                                      
2016-08-23 00:00:14.161128  2170.75        1  2170.75  2171.00
2016-08-23 00:00:14.901180  2171.00        1  2170.75  2171.00
2016-08-23 00:00:17.196639  2170.75        1  2170.75  2171.00
2016-08-23 00:00:17.664193  2171.00        1  2170.75  2171.00
2016-08-23 00:00:17.664193  2171.00        1  2170.75  2171.00
2016-08-23 00:00:17.664193  2171.00        2  2170.75  2171.00
2016-08-23 00:00:17.664193  2171.00        1  2170.75  2171.00
2016-08-23 00:00:26.206108  2170.75        2  2170.75  2171.00
2016-08-23 00:00:28.322456  2170.75        7  2170.75  2171.00
2016-08-23 00:00:28.322456  2170.75        1  2170.75  2171.00

In this example, there are only a few duplicates, but in some cases, there are hundreds of consecutive rows, all sharing the same timestamp. I'm aiming to solve this by adding 1 extra nanosecond to each duplicate (so in the case of 4 consecutive identical timestamps, I'd add 1ns to the second, 2ns to the 3rd, and 3ns to the fourth. For example, the data above would be converted to:

In [169]: make_timestamps_unique(ts)
Out[170]:
                                  last  last_sz      bid     ask
newindex                                                        
2016-08-23 00:00:14.161128000  2170.75        1  2170.75  2171.0
2016-08-23 00:00:14.901180000  2171.00        1  2170.75  2171.0
2016-08-23 00:00:17.196639000  2170.75        1  2170.75  2171.0
2016-08-23 00:00:17.664193000  2171.00        1  2170.75  2171.0
2016-08-23 00:00:17.664193001  2171.00        1  2170.75  2171.0
2016-08-23 00:00:17.664193002  2171.00        2  2170.75  2171.0
2016-08-23 00:00:17.664193003  2171.00        1  2170.75  2171.0
2016-08-23 00:00:26.206108000  2170.75        2  2170.75  2171.0
2016-08-23 00:00:28.322456000  2170.75        7  2170.75  2171.0
2016-08-23 00:00:28.322456001  2170.75        1  2170.75  2171.0

I've struggled to find a good way to do this - my current solution is to make multiple passes, checking for duplicates each time, and adding 1ns to all but the first in a series of identical timestamps. Here's the code:

def make_timestamps_unique(ts):
    mask = ts.index.duplicated('first')
    duplicate_count = np.sum(mask)
    passes = 0

    while duplicate_count > 0:
        ts.loc[:, 'newindex'] = ts.index
        ts.loc[mask, 'newindex'] += pd.Timedelta('1ns')
        ts = ts.set_index('newindex')
        mask = ts.index.duplicated('first')
        duplicate_count = np.sum(mask)
        passes += 1

    print('%d passes of duplication loop' % passes)
    return ts

This is obviously quite inefficient - it often requires hundreds of passes, and if I try it on a 2 million row dataframe, I get a MemoryError. Any ideas for a better way to achieve this?

Upvotes: 8

Views: 2591

Answers (3)

pansen
pansen

Reputation: 6663

Here is a faster numpy version (but little less readable) which is inspired from this SO article. The idea is to use cumsum on duplicated timestamp values while resetting the cumulative sum each time a np.NaN is encountered:

# get duplicated values as float and replace 0 with NaN
values = df.index.duplicated(keep=False).astype(float)
values[values==0] = np.NaN

missings = np.isnan(values)
cumsum = np.cumsum(~missings)
diff = np.diff(np.concatenate(([0.], cumsum[missings])))
values[missings] = -diff

# print result
result = df.index + np.cumsum(values).astype(np.timedelta64)
print(result)

DatetimeIndex([   '2016-08-23 00:00:14.161128',
                  '2016-08-23 00:00:14.901180',
                  '2016-08-23 00:00:17.196639',
               '2016-08-23 00:00:17.664193001',
               '2016-08-23 00:00:17.664193002',
               '2016-08-23 00:00:17.664193003',
               '2016-08-23 00:00:17.664193004',
                  '2016-08-23 00:00:26.206108',
               '2016-08-23 00:00:28.322456001',
               '2016-08-23 00:00:28.322456002'],
              dtype='datetime64[ns]', name='datetime', freq=None)

Timing this solution yields 10000 loops, best of 3: 107 µs per loop whereas the @DYZ groupby/apply approach (but more readable) is roughly 50 times slower on the dummy data with 100 loops, best of 3: 5.3 ms per loop.

Of course, you have to reset your index, finally:

df.index = result

Upvotes: 8

SerialDev
SerialDev

Reputation: 2847

Lets start with a vectorized benchmark since you are dealing with 1M+ rows this should be a priority:

%timeit do
10000000 loops, best of 3: 20.5 ns per loop

Lets make some test data since none was provided:

rng = pd.date_range('1/1/2011', periods=72, freq='H')

df = pd.DataFrame(dict(time = rng))

Duplicate the timestamps:

df =pd.concat((df, df))
df =df.sort()

df
Out [296]:
                  time
0  2011-01-01 00:00:00
0  2011-01-01 00:00:00
1  2011-01-01 01:00:00
1  2011-01-01 01:00:00
2  2011-01-01 02:00:00
2  2011-01-01 02:00:00
3  2011-01-01 03:00:00
3  2011-01-01 03:00:00
4  2011-01-01 04:00:00
4  2011-01-01 04:00:00
5  2011-01-01 05:00:00
5  2011-01-01 05:00:00
6  2011-01-01 06:00:00
6  2011-01-01 06:00:00
7  2011-01-01 07:00:00
7  2011-01-01 07:00:00
8  2011-01-01 08:00:00
8  2011-01-01 08:00:00
9  2011-01-01 09:00:00
9  2011-01-01 09:00:00

Find the locations where the difference in time from the previous row is 0 seconds

mask = (df.time-df.time.shift()) == np.timedelta64(0,'s')

mask
Out [307]:
0     False
0      True
1     False
1      True
2     False
2      True
3     False
3      True
4     False
4      True
5     False

Offset these locations : in this case I chose milliseconds

df.loc[mask,'time'] = df.time[mask].apply(lambda x: x+pd.offsets.Milli(5))

Out [309]:
                      time
0  2011-01-01 00:00:00.000
0  2011-01-01 00:00:00.005
1  2011-01-01 01:00:00.000
1  2011-01-01 01:00:00.005
2  2011-01-01 02:00:00.000
2  2011-01-01 02:00:00.005
3  2011-01-01 03:00:00.000
3  2011-01-01 03:00:00.005
4  2011-01-01 04:00:00.000
4  2011-01-01 04:00:00.005
5  2011-01-01 05:00:00.000

EDIT: With consecutive timestamps [This assumes 4]

consect = 4 
for i in range(4):
    mask = (df.time-df.time.shift(consect)) == np.timedelta64(0,'s')
    df.loc[mask,'time'] = df.time[mask].apply(lambda x: x+pd.offsets.Milli(5+i))
    consect -= 1

Upvotes: 1

DYZ
DYZ

Reputation: 57033

You can group the rows by the index and then add a range of sequential timedeltas to the index of each group. I am not sure if this can be done directly with the index, but you can first convert the index to an ordinary column, apply the operation to the column, and set the column as the index again:

newindex = ts.reset_index()\
             .groupby('datetime')['datetime']\
             .apply(lambda x: x + np.arange(x.size).astype(np.timedelta64))
df.index = newindex

Upvotes: 5

Related Questions