MarkNS
MarkNS

Reputation: 4021

Grouping rows by time-range in Pandas dataframe

I have a large dataframe indexed by timestamps in which I would like to assign the rows to groups according to a time range.

In the following data for example, I have grouped the rows within 1ms of the first entry in the group.

                           groupid
1999-12-31 23:59:59.000107       1
1999-12-31 23:59:59.000385       1
1999-12-31 23:59:59.000404       1
1999-12-31 23:59:59.000704       1
1999-12-31 23:59:59.001281       2
1999-12-31 23:59:59.002211       2
1999-12-31 23:59:59.002367       3

I have working code which does this by iterating the rows, and using the current row to slice the dataframe:

dts = sorted([datetime(1999, 12, 31, 23, 59, 59, x) for
              x in np.random.randint(1, 999999, 1000)])
df = pd.DataFrame({'groupid': None}, dts)

print df.head(20)

groupid = 1
for dt, row in df.iterrows():
    if df.loc[row.name].groupid:
        continue
    end = dt + timedelta(milliseconds=1)
    group = df.loc[dt:end]
    df.loc[group.index, 'groupid'] = groupid
    groupid += 1

print df.head(20)

However, as is common with iterrows, the operation is slow on large dataframes. I've made various attempts at applying a function and using groupby, but without success. Is using itertuples the best I can do for a performance boost(which I'm going to try now)? Could someone give some advice please?

Upvotes: 2

Views: 2222

Answers (2)

EdChum
EdChum

Reputation: 393933

OK, I think the following is what you want, this constructs a TimeDelta from your index by subtracting all values by the first value. We then access the microseconds component and divide by 1000 and then cast the Series dtype to int:

In [86]:

df['groupid'] = ((df.index.to_series() - df.index[0]).dt.microseconds / 1000).astype(np.int32)
df
Out[86]:
                            groupid
1999-12-31 23:59:59.000133        0
1999-12-31 23:59:59.000584        0
1999-12-31 23:59:59.003544        3
1999-12-31 23:59:59.009193        9
1999-12-31 23:59:59.010220       10
1999-12-31 23:59:59.010632       10
1999-12-31 23:59:59.010716       10
1999-12-31 23:59:59.011387       11
1999-12-31 23:59:59.011837       11
1999-12-31 23:59:59.013277       13
1999-12-31 23:59:59.013305       13
1999-12-31 23:59:59.014754       14
1999-12-31 23:59:59.016015       15
1999-12-31 23:59:59.016067       15
1999-12-31 23:59:59.017788       17
1999-12-31 23:59:59.018236       18
1999-12-31 23:59:59.021281       21
1999-12-31 23:59:59.021772       21
1999-12-31 23:59:59.021927       21
1999-12-31 23:59:59.022200       22
1999-12-31 23:59:59.023104       22
1999-12-31 23:59:59.023375       23
1999-12-31 23:59:59.023688       23
1999-12-31 23:59:59.023726       23
1999-12-31 23:59:59.025397       25
1999-12-31 23:59:59.026407       26
1999-12-31 23:59:59.026480       26
1999-12-31 23:59:59.027825       27
1999-12-31 23:59:59.028793       28
1999-12-31 23:59:59.030716       30
...                             ...
1999-12-31 23:59:59.975432      975
1999-12-31 23:59:59.976699      976
1999-12-31 23:59:59.977177      977
1999-12-31 23:59:59.979475      979
1999-12-31 23:59:59.980282      980
1999-12-31 23:59:59.980672      980
1999-12-31 23:59:59.983202      983
1999-12-31 23:59:59.984214      984
1999-12-31 23:59:59.984674      984
1999-12-31 23:59:59.984933      984
1999-12-31 23:59:59.985664      985
1999-12-31 23:59:59.985779      985
1999-12-31 23:59:59.988812      988
1999-12-31 23:59:59.989324      989
1999-12-31 23:59:59.990386      990
1999-12-31 23:59:59.990485      990
1999-12-31 23:59:59.990969      990
1999-12-31 23:59:59.991255      991
1999-12-31 23:59:59.991739      991
1999-12-31 23:59:59.993979      993
1999-12-31 23:59:59.994705      994
1999-12-31 23:59:59.994874      994
1999-12-31 23:59:59.995397      995
1999-12-31 23:59:59.995753      995
1999-12-31 23:59:59.995863      995
1999-12-31 23:59:59.996574      996
1999-12-31 23:59:59.998139      998
1999-12-31 23:59:59.998533      998
1999-12-31 23:59:59.998778      998
1999-12-31 23:59:59.999915      999

Thanks to @Jeff for pointing out the much cleaner method:

In [96]:
df['groupid'] = (df.index-df.index[0]).astype('timedelta64[ms]')
df

Out[96]:
                            groupid
1999-12-31 23:59:59.000884        0
1999-12-31 23:59:59.001175        0
1999-12-31 23:59:59.001262        0
1999-12-31 23:59:59.001540        0
1999-12-31 23:59:59.001769        0
1999-12-31 23:59:59.002478        1
1999-12-31 23:59:59.005001        4
1999-12-31 23:59:59.005497        4
1999-12-31 23:59:59.006908        6
1999-12-31 23:59:59.008860        7
1999-12-31 23:59:59.009257        8
1999-12-31 23:59:59.010012        9
1999-12-31 23:59:59.011451       10
1999-12-31 23:59:59.013177       12
1999-12-31 23:59:59.014138       13
1999-12-31 23:59:59.015795       14
1999-12-31 23:59:59.015865       14
1999-12-31 23:59:59.016069       15
1999-12-31 23:59:59.016666       15
1999-12-31 23:59:59.016718       15
1999-12-31 23:59:59.019058       18
1999-12-31 23:59:59.019675       18
1999-12-31 23:59:59.020747       19
1999-12-31 23:59:59.021856       20
1999-12-31 23:59:59.022959       22
1999-12-31 23:59:59.023812       22
1999-12-31 23:59:59.023938       23
1999-12-31 23:59:59.024122       23
1999-12-31 23:59:59.025332       24
1999-12-31 23:59:59.025397       24
...                             ...
1999-12-31 23:59:59.959725      958
1999-12-31 23:59:59.959742      958
1999-12-31 23:59:59.959892      959
1999-12-31 23:59:59.960345      959
1999-12-31 23:59:59.960800      959
1999-12-31 23:59:59.961054      960
1999-12-31 23:59:59.962749      961
1999-12-31 23:59:59.965681      964
1999-12-31 23:59:59.966409      965
1999-12-31 23:59:59.966558      965
1999-12-31 23:59:59.967357      966
1999-12-31 23:59:59.967842      966
1999-12-31 23:59:59.970465      969
1999-12-31 23:59:59.974022      973
1999-12-31 23:59:59.974734      973
1999-12-31 23:59:59.975879      974
1999-12-31 23:59:59.978291      977
1999-12-31 23:59:59.980483      979
1999-12-31 23:59:59.980868      979
1999-12-31 23:59:59.981417      980
1999-12-31 23:59:59.984208      983
1999-12-31 23:59:59.984639      983
1999-12-31 23:59:59.985533      984
1999-12-31 23:59:59.986785      985
1999-12-31 23:59:59.987502      986
1999-12-31 23:59:59.987914      987
1999-12-31 23:59:59.988406      987
1999-12-31 23:59:59.989436      988
1999-12-31 23:59:59.994449      993
1999-12-31 23:59:59.996657      995

Upvotes: 4

Jeff
Jeff

Reputation: 128928

This is like a resample operation.

Create your data

In [39]: pd.set_option('max_rows',12)

In [40]: np.random.seed(11111)

In [41]: dts = sorted([datetime(1999, 12, 31, 23, 59, 59, x) for
              x in np.random.randint(1, 999999, 1000)])

In [42]: df = pd.DataFrame({'groupid': np.random.randn(len(dts))}, dts)

So simply grouping gives you the groups directly. You can iterate as this is a generator.

In [43]: list(df.groupby(pd.Grouper(freq='ms')))[0:3]
Out[43]: 
[(Timestamp('1999-12-31 23:59:59', offset='L'),
                               groupid
  1999-12-31 23:59:59.000789 -1.369503
  1999-12-31 23:59:59.000814  0.776049),
 (Timestamp('1999-12-31 23:59:59.001000', offset='L'),
                               groupid
  1999-12-31 23:59:59.001041 -0.374915
  1999-12-31 23:59:59.001062 -1.470845),
 (Timestamp('1999-12-31 23:59:59.002000', offset='L'),
                               groupid
  1999-12-31 23:59:59.002355 -0.240954)]

Might be simpler just to resample. You can use a custom function for how.

In [44]: df.resample('ms',how='sum')
Out[44]: 
                          groupid
1999-12-31 23:59:59.000 -0.593454
1999-12-31 23:59:59.001 -1.845759
1999-12-31 23:59:59.002 -0.240954
1999-12-31 23:59:59.003  1.291403
1999-12-31 23:59:59.004       NaN
1999-12-31 23:59:59.005  0.291484
...                           ...
1999-12-31 23:59:59.994       NaN
1999-12-31 23:59:59.995       NaN
1999-12-31 23:59:59.996       NaN
1999-12-31 23:59:59.997 -0.445052
1999-12-31 23:59:59.998       NaN
1999-12-31 23:59:59.999 -0.895305

[1000 rows x 1 columns]

Upvotes: 2

Related Questions