Reputation: 10794
I have the following code and the algorithm is pretty slow.
I tried to use preallocation with numpy
(1.14.3) and converting string columns in pandas
(0.23.0) into category
to speed up the code, but it is still very slow.
my df
is a large pd.DataFrame
. len(df)
returns 1342058
and has 25 columns.
df
contains timestamps with certain events and locations. The frequency of the data is variable, going from second data to hourly data. I basically want to resample the data for each client_id and a given date arrow (rng
) (every 5 minutes between start
and end
).
df = pd.DataFrame()
df['client_id']=['40:a3:cc:XX:XX:XX','28:c6:3f:XX:XX:XX','40:a3:cc:XX:XX:XX','40:a3:cc:XX:XX:XX','28:c6:3f:XX:XX:XX','40:a3:cc:XX:XX:XX','40:a3:cc:XX:XX:XX','28:c6:3f:XX:XX:XX','28:c6:3f:XX:XX:XX','40:a3:cc:XX:XX:XX']
df['seen_time'] = ['2018-09-01 00:00:03+00:00', '2018-09-01 00:00:04+00:00','2018-09-01 00:00:05+00:00','2018-09-01 00:00:06+00:00','2018-09-01 00:00:08+00:00','2018-09-01 00:00:09+00:00','2018-09-01 00:00:09+00:00','2018-09-01 00:00:14+00:00','2018-09-01 00:00:19+00:00','2018-09-01 00:00:25+00:00']
df['location_x'] = [7.488775,20.163136,19.485196,12.841458,15.508627,5.708157,13.451071,19.1871,65,9.015443,28.266964]
Here is my code:
start = "2018-09-01 00:00"
end = "2018-09-28 00:00"
rng = pd.date_range(start=start, end=end, freq='5T') # date range in x min.
df['client_id'] = df['client_id'].astype('category')
# --> this line already improved the speed of the algorithm by 20x
def processDataSingle(client_id):
df_ = pd.DataFrame() # temporariy dataframe.
data_ = np.zeros((len(rng),)).tolist() # initialize.
dataSize = len(rng)
for idx,d in enumerate(rng):
t0 = d.tz_localize('UTC')
t1 = (d + pd.to_timedelta(filter_min,unit='m')).tz_localize('UTC')
# t0 and t1 are start and end time, with 5 min difference for the query.
df_client = df[df['client_id']==client_id]
# selects the data of the required client_id only.
df_client_range = df_client[(df_client['seen_time']>=t0)&(df_client['seen_time']<t1)]
# selects only the data for this client and between t0 and t1
content = {
'x_mean':df_client_range['location_x'].mean(),
'x_median':df_client_range['location_x'].median(),
'x_min': df_client_range['location_x'].min(),
'x_max':df_client_range['location_x'].max()
}
data_[idx] = content
# end of looping over date range.
df_[client_id] = data_
return df_
def processDataArrayWrapper(client_ids):
results = np.zeros((len(client_ids),)).tolist() # preallocate!!! a=np.zeros((2,))
for iclient,client_id in enumerate(client_ids):
results[iclient] = processDataSingle(client_id)
# end looping over client_ids
return results
#end of processDataArrayWrapper
client_ids = ['firstID','secondID','thirdID'] # in real world around 1000 IDs
%prun results = processDataArrayWrapper(client_ids)
the result of prun
is the following:
158001796 function calls (155889787 primitive calls) in 483.930 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
44934 209.957 0.005 209.957 0.005 {pandas._libs.algos.take_2d_axis1_object_object}
381942 44.206 0.000 44.379 0.000 {built-in method pandas._libs.algos.ensure_int64}
22468 20.656 0.001 95.913 0.004 categorical.py:2437(_recode_for_categories)
67402 20.583 0.000 20.583 0.000 {pandas._libs.algos.take_1d_int16_int16}
3 20.274 6.758 483.715 161.238 <ipython-input-49-c130ea2c69e4>:23(processDataSingle)
44934 9.177 0.000 9.177 0.000 {pandas._libs.algos.take_2d_axis1_int64_int64}
44934 9.107 0.000 9.107 0.000 {pandas._libs.algos.take_2d_axis1_float64_float64}
36873382 7.775 0.000 15.205 0.000 {built-in method builtins.isinstance}
44934 7.592 0.000 7.592 0.000 {method 'nonzero' of 'numpy.ndarray' objects}
11899203 4.604 0.000 5.136 0.000 {built-in method builtins.getattr}
22471 4.368 0.000 4.450 0.000 {method 'get_indexer' of 'pandas._libs.index.IndexEngine' objects}
44934 3.644 0.000 3.644 0.000 {pandas._libs.algos.take_1d_int64_int64}
22467 3.479 0.000 3.776 0.000 categorical.py:64(f)
643042 3.380 0.000 3.380 0.000 {method 'reduce' of 'numpy.ufunc' objects}
148422 3.016 0.000 3.016 0.000 {method 'copy' of 'numpy.ndarray' objects}
2557465/2534997 2.740 0.000 4.622 0.000 common.py:1835(_get_dtype_type)
337007/292073 2.727 0.000 318.981 0.001 algorithms.py:1545(take_nd)
7351676 2.672 0.000 5.713 0.000 generic.py:7(_check)
404409/359475 2.359 0.000 20.467 0.000 series.py:165(__init__)
8155845/6470735 2.152 0.000 2.873 0.000 {built-in method builtins.len}
471840 2.111 0.000 2.111 0.000 {built-in method numpy.core.multiarray.empty}
89874 2.109 0.000 4.487 0.000 internals.py:3363(_rebuild_blknos_and_blklocs)
2610562 2.028 0.000 5.330 0.000 base.py:61(is_dtype)
1046990/1002055 1.888 0.000 2.184 0.000 {built-in method numpy.core.multiarray.array}
44935 1.830 0.000 1.830 0.000 {method 'take' of 'numpy.ndarray' objects}
4713590 1.757 0.000 1.773 0.000 {built-in method builtins.hasattr}
134826/89883 1.750 0.000 7.191 0.000 base.py:250(__new__)
247139 1.647 0.000 6.430 0.000 cast.py:257(maybe_promote)
1087307 1.523 0.000 5.693 0.000 common.py:1688(is_extension_array_dtype)
134802/89868 1.475 0.000 18.290 0.000 datetimes.py:329(__new__)
2233231 1.437 0.000 2.797 0.000 <frozen importlib._bootstrap>:997(_handle_fromlist)
718950 1.411 0.000 2.764 0.000 generic.py:4374(__setattr__)
763890/629088 1.361 0.000 2.083 0.000 {method 'format' of 'str' objects}
224670 1.314 0.000 254.948 0.001 internals.py:1237(take_nd)
629082 1.249 0.000 2.836 0.000 internals.py:116(__init__)
314539 1.145 0.000 303.889 0.001 frame.py:2661(__getitem__)
...
67401 0.634 0.000 134.602 0.002 ops.py:1175(wrapper)
44934 0.625 0.000 277.776 0.006 internals.py:4518(take)
...
67404 0.558 0.000 101.225 0.002 categorical.py:267(__init__)
...
89868 0.527 0.000 11.864 0.000 datetimelike.py:523(take)
44934 0.522 0.000 285.513 0.006 generic.py:2780(_take)
1392955 0.517 0.000 0.599 0.000 internals.py:352(dtype)
44934/22467 0.514 0.000 12.605 0.001 {built-in method _operator.ge}
...
44937 0.425 0.000 264.034 0.006 internals.py:4423(<listcomp>)
...
763912 0.296 0.000 1.993 0.000 numeric.py:424(asarray)
44934 0.295 0.000 22.606 0.001 datetimes.py:109(wrapper)
381946 0.289 0.000 2.251 0.000 {method 'any' of 'numpy.ndarray' objects}
89892 0.286 0.000 0.960 0.000 common.py:298(_asarray_tuplesafe)
67404 0.281 0.000 1.299 0.000 cast.py:971(maybe_cast_to_datetime)
89868 0.280 0.000 2.962 0.000 internals.py:4108(get)
629082 0.276 0.000 0.276 0.000 internals.py:127(_check_ndim)
44934 0.276 0.000 295.073 0.007 frame.py:2704(_getitem_array)
107568 0.267 0.000 2.324 0.000 missing.py:494(na_value_for_dtype)
179742 0.264 0.000 0.827 0.000 dtypes.py:459(construct_from_string)
44937 0.260 0.000 270.290 0.006 internals.py:4388(reindex_indexer)
...
67401 0.220 0.000 128.302 0.002 ops.py:1073(dispatch_to_index_op)
...
1 0.194 0.194 483.930 483.930 <string>:1(<module>)
...
44934 0.179 0.000 20.583 0.000 base.py:89(cmp_method)
...
44934 0.110 0.000 10.841 0.000 datetimes.py:39(_maybe_cache)
...
1 0.003 0.003 483.719 483.719 <ipython-input-49-c130ea2c69e4>:61(processDataArrayWrapper)
...
1 0.000 0.000 483.930 483.930 {built-in method builtins.exec}
Note: I deleted a bunch of lines where cumtime
was little. I believe the most time was consumed in take_2d_axis1_object_object
, but I'm having difficulties in understanding the problem. Which part of the code is causing this problem?
I'm running the code on my Mac book pro with Intel Core i7 2,7 GHz with 4 cores, and the python process uses nearly 400% of CPU, which means that it somehow uses multi core in the back.
Upvotes: 1
Views: 2531
Reputation: 10794
I was able to come up with a pretty cool solution that reduces the execution from several hours to less than 15 sec.
def resampleData():
results = {}
for idx,thisIDIndex in df.groupby('client_id').groups.items():
df_id = df.loc[thisIDIndex]
df_id_sel = df_id[['seen_time','location_x','location_y']] # only the stuff I really need.
df_id_sel = df_id_sel.set_index('seen_time')
df_id_sel_resampled_mean = df_id_sel.resample('5T').mean()
df_id_sel_resampled_max = df_id_sel.resample('5T').max()
df_id_sel_resampled_min = df_id_sel.resample('5T').min()
df_id_sel_resampled_median= df_id_sel.resample('5T').median()
df_id_sel_resampled_mean = df_id_sel_resampled_mean.rename(columns={'location_x':'x_mean', 'location_y':'y_mean'})
df_id_sel_resampled_max = df_id_sel_resampled_max.rename(columns={'location_x':'x_max', 'location_y':'y_max'})
df_id_sel_resampled_min = df_id_sel_resampled_min.rename(columns={'location_x':'x_min', 'location_y':'y_min'})
df_id_sel_resampled_median = df_id_sel_resampled_median.rename(columns={'location_x':'x_med', 'location_y':'y_med'})
DF = pd.concat([df_id_sel_resampled_min, df_id_sel_resampled_max,df_id_sel_resampled_mean, df_id_sel_resampled_median], axis=1, sort=False)
#print('--- Index ' + idx + ' done ')
results[idx] = DF
# end loop over ids
return results
#end of doStupidLoop
I'm now only looping over the id's, and then access the data by groupby
as pointed out. I also use resample
to to the job instead of iterating over the time axis. I could probably optimise the code further, but I'm happy now with the solution and it is still very explicit and easy to understand and follow.
Upvotes: 2