Reputation: 4021
I have a large dataframe indexed by timestamps in which I would like to assign the rows to groups according to a time range.
In the following data for example, I have grouped the rows within 1ms of the first entry in the group.
groupid
1999-12-31 23:59:59.000107 1
1999-12-31 23:59:59.000385 1
1999-12-31 23:59:59.000404 1
1999-12-31 23:59:59.000704 1
1999-12-31 23:59:59.001281 2
1999-12-31 23:59:59.002211 2
1999-12-31 23:59:59.002367 3
I have working code which does this by iterating the rows, and using the current row to slice the dataframe:
dts = sorted([datetime(1999, 12, 31, 23, 59, 59, x) for
x in np.random.randint(1, 999999, 1000)])
df = pd.DataFrame({'groupid': None}, dts)
print df.head(20)
groupid = 1
for dt, row in df.iterrows():
if df.loc[row.name].groupid:
continue
end = dt + timedelta(milliseconds=1)
group = df.loc[dt:end]
df.loc[group.index, 'groupid'] = groupid
groupid += 1
print df.head(20)
However, as is common with iterrows, the operation is slow on large dataframes. I've made various attempts at applying a function and using groupby, but without success. Is using itertuples the best I can do for a performance boost(which I'm going to try now)? Could someone give some advice please?
Upvotes: 2
Views: 2222
Reputation: 393933
OK, I think the following is what you want, this constructs a TimeDelta from your index by subtracting all values by the first value. We then access the microseconds component and divide by 1000 and then cast the Series dtype to int:
In [86]:
df['groupid'] = ((df.index.to_series() - df.index[0]).dt.microseconds / 1000).astype(np.int32)
df
Out[86]:
groupid
1999-12-31 23:59:59.000133 0
1999-12-31 23:59:59.000584 0
1999-12-31 23:59:59.003544 3
1999-12-31 23:59:59.009193 9
1999-12-31 23:59:59.010220 10
1999-12-31 23:59:59.010632 10
1999-12-31 23:59:59.010716 10
1999-12-31 23:59:59.011387 11
1999-12-31 23:59:59.011837 11
1999-12-31 23:59:59.013277 13
1999-12-31 23:59:59.013305 13
1999-12-31 23:59:59.014754 14
1999-12-31 23:59:59.016015 15
1999-12-31 23:59:59.016067 15
1999-12-31 23:59:59.017788 17
1999-12-31 23:59:59.018236 18
1999-12-31 23:59:59.021281 21
1999-12-31 23:59:59.021772 21
1999-12-31 23:59:59.021927 21
1999-12-31 23:59:59.022200 22
1999-12-31 23:59:59.023104 22
1999-12-31 23:59:59.023375 23
1999-12-31 23:59:59.023688 23
1999-12-31 23:59:59.023726 23
1999-12-31 23:59:59.025397 25
1999-12-31 23:59:59.026407 26
1999-12-31 23:59:59.026480 26
1999-12-31 23:59:59.027825 27
1999-12-31 23:59:59.028793 28
1999-12-31 23:59:59.030716 30
... ...
1999-12-31 23:59:59.975432 975
1999-12-31 23:59:59.976699 976
1999-12-31 23:59:59.977177 977
1999-12-31 23:59:59.979475 979
1999-12-31 23:59:59.980282 980
1999-12-31 23:59:59.980672 980
1999-12-31 23:59:59.983202 983
1999-12-31 23:59:59.984214 984
1999-12-31 23:59:59.984674 984
1999-12-31 23:59:59.984933 984
1999-12-31 23:59:59.985664 985
1999-12-31 23:59:59.985779 985
1999-12-31 23:59:59.988812 988
1999-12-31 23:59:59.989324 989
1999-12-31 23:59:59.990386 990
1999-12-31 23:59:59.990485 990
1999-12-31 23:59:59.990969 990
1999-12-31 23:59:59.991255 991
1999-12-31 23:59:59.991739 991
1999-12-31 23:59:59.993979 993
1999-12-31 23:59:59.994705 994
1999-12-31 23:59:59.994874 994
1999-12-31 23:59:59.995397 995
1999-12-31 23:59:59.995753 995
1999-12-31 23:59:59.995863 995
1999-12-31 23:59:59.996574 996
1999-12-31 23:59:59.998139 998
1999-12-31 23:59:59.998533 998
1999-12-31 23:59:59.998778 998
1999-12-31 23:59:59.999915 999
Thanks to @Jeff for pointing out the much cleaner method:
In [96]:
df['groupid'] = (df.index-df.index[0]).astype('timedelta64[ms]')
df
Out[96]:
groupid
1999-12-31 23:59:59.000884 0
1999-12-31 23:59:59.001175 0
1999-12-31 23:59:59.001262 0
1999-12-31 23:59:59.001540 0
1999-12-31 23:59:59.001769 0
1999-12-31 23:59:59.002478 1
1999-12-31 23:59:59.005001 4
1999-12-31 23:59:59.005497 4
1999-12-31 23:59:59.006908 6
1999-12-31 23:59:59.008860 7
1999-12-31 23:59:59.009257 8
1999-12-31 23:59:59.010012 9
1999-12-31 23:59:59.011451 10
1999-12-31 23:59:59.013177 12
1999-12-31 23:59:59.014138 13
1999-12-31 23:59:59.015795 14
1999-12-31 23:59:59.015865 14
1999-12-31 23:59:59.016069 15
1999-12-31 23:59:59.016666 15
1999-12-31 23:59:59.016718 15
1999-12-31 23:59:59.019058 18
1999-12-31 23:59:59.019675 18
1999-12-31 23:59:59.020747 19
1999-12-31 23:59:59.021856 20
1999-12-31 23:59:59.022959 22
1999-12-31 23:59:59.023812 22
1999-12-31 23:59:59.023938 23
1999-12-31 23:59:59.024122 23
1999-12-31 23:59:59.025332 24
1999-12-31 23:59:59.025397 24
... ...
1999-12-31 23:59:59.959725 958
1999-12-31 23:59:59.959742 958
1999-12-31 23:59:59.959892 959
1999-12-31 23:59:59.960345 959
1999-12-31 23:59:59.960800 959
1999-12-31 23:59:59.961054 960
1999-12-31 23:59:59.962749 961
1999-12-31 23:59:59.965681 964
1999-12-31 23:59:59.966409 965
1999-12-31 23:59:59.966558 965
1999-12-31 23:59:59.967357 966
1999-12-31 23:59:59.967842 966
1999-12-31 23:59:59.970465 969
1999-12-31 23:59:59.974022 973
1999-12-31 23:59:59.974734 973
1999-12-31 23:59:59.975879 974
1999-12-31 23:59:59.978291 977
1999-12-31 23:59:59.980483 979
1999-12-31 23:59:59.980868 979
1999-12-31 23:59:59.981417 980
1999-12-31 23:59:59.984208 983
1999-12-31 23:59:59.984639 983
1999-12-31 23:59:59.985533 984
1999-12-31 23:59:59.986785 985
1999-12-31 23:59:59.987502 986
1999-12-31 23:59:59.987914 987
1999-12-31 23:59:59.988406 987
1999-12-31 23:59:59.989436 988
1999-12-31 23:59:59.994449 993
1999-12-31 23:59:59.996657 995
Upvotes: 4
Reputation: 128928
This is like a resample operation.
Create your data
In [39]: pd.set_option('max_rows',12)
In [40]: np.random.seed(11111)
In [41]: dts = sorted([datetime(1999, 12, 31, 23, 59, 59, x) for
x in np.random.randint(1, 999999, 1000)])
In [42]: df = pd.DataFrame({'groupid': np.random.randn(len(dts))}, dts)
So simply grouping gives you the groups directly. You can iterate as this is a generator.
In [43]: list(df.groupby(pd.Grouper(freq='ms')))[0:3]
Out[43]:
[(Timestamp('1999-12-31 23:59:59', offset='L'),
groupid
1999-12-31 23:59:59.000789 -1.369503
1999-12-31 23:59:59.000814 0.776049),
(Timestamp('1999-12-31 23:59:59.001000', offset='L'),
groupid
1999-12-31 23:59:59.001041 -0.374915
1999-12-31 23:59:59.001062 -1.470845),
(Timestamp('1999-12-31 23:59:59.002000', offset='L'),
groupid
1999-12-31 23:59:59.002355 -0.240954)]
Might be simpler just to resample. You can use a custom function for how
.
In [44]: df.resample('ms',how='sum')
Out[44]:
groupid
1999-12-31 23:59:59.000 -0.593454
1999-12-31 23:59:59.001 -1.845759
1999-12-31 23:59:59.002 -0.240954
1999-12-31 23:59:59.003 1.291403
1999-12-31 23:59:59.004 NaN
1999-12-31 23:59:59.005 0.291484
... ...
1999-12-31 23:59:59.994 NaN
1999-12-31 23:59:59.995 NaN
1999-12-31 23:59:59.996 NaN
1999-12-31 23:59:59.997 -0.445052
1999-12-31 23:59:59.998 NaN
1999-12-31 23:59:59.999 -0.895305
[1000 rows x 1 columns]
Upvotes: 2