python code with large pandas DataFrame is to slow

Question

I have the following code and the algorithm is pretty slow. I tried to use preallocation with numpy (1.14.3) and converting string columns in pandas (0.23.0) into category to speed up the code, but it is still very slow.

my df is a large pd.DataFrame. len(df) returns 1342058 and has 25 columns.

df contains timestamps with certain events and locations. The frequency of the data is variable, going from second data to hourly data. I basically want to resample the data for each client_id and a given date arrow (rng) (every 5 minutes between start and end).

df = pd.DataFrame()
df['client_id']=['40:a3:cc:XX:XX:XX','28:c6:3f:XX:XX:XX','40:a3:cc:XX:XX:XX','40:a3:cc:XX:XX:XX','28:c6:3f:XX:XX:XX','40:a3:cc:XX:XX:XX','40:a3:cc:XX:XX:XX','28:c6:3f:XX:XX:XX','28:c6:3f:XX:XX:XX','40:a3:cc:XX:XX:XX']

df['seen_time'] = ['2018-09-01 00:00:03+00:00', '2018-09-01 00:00:04+00:00','2018-09-01 00:00:05+00:00','2018-09-01 00:00:06+00:00','2018-09-01 00:00:08+00:00','2018-09-01 00:00:09+00:00','2018-09-01 00:00:09+00:00','2018-09-01 00:00:14+00:00','2018-09-01 00:00:19+00:00','2018-09-01 00:00:25+00:00']

df['location_x'] = [7.488775,20.163136,19.485196,12.841458,15.508627,5.708157,13.451071,19.1871,65,9.015443,28.266964]

Here is my code:

start = "2018-09-01 00:00"
end = "2018-09-28 00:00"
rng = pd.date_range(start=start, end=end, freq='5T') # date range in x min.

df['client_id'] = df['client_id'].astype('category')
# --> this line already improved the speed of the algorithm by 20x

def processDataSingle(client_id):
    df_ = pd.DataFrame() # temporariy dataframe.
    data_ = np.zeros((len(rng),)).tolist() # initialize.
    dataSize = len(rng)
    for idx,d in enumerate(rng):
        t0 = d.tz_localize('UTC')
        t1 = (d + pd.to_timedelta(filter_min,unit='m')).tz_localize('UTC')
        # t0 and t1 are start and end time, with 5 min difference for the query.
        df_client = df[df['client_id']==client_id]
        # selects the data of the required client_id only.

        df_client_range = df_client[(df_client['seen_time']>=t0)&(df_client['seen_time']



the result of prun is the following: 

158001796 function calls (155889787 primitive calls) in 483.930 seconds

Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
44934  209.957    0.005  209.957    0.005 {pandas._libs.algos.take_2d_axis1_object_object}
381942   44.206    0.000   44.379    0.000 {built-in method pandas._libs.algos.ensure_int64}
22468   20.656    0.001   95.913    0.004 categorical.py:2437(_recode_for_categories)
67402   20.583    0.000   20.583    0.000 {pandas._libs.algos.take_1d_int16_int16}
3   20.274    6.758  483.715  161.238 :23(processDataSingle)
44934    9.177    0.000    9.177    0.000 {pandas._libs.algos.take_2d_axis1_int64_int64}
44934    9.107    0.000    9.107    0.000 {pandas._libs.algos.take_2d_axis1_float64_float64}
36873382    7.775    0.000   15.205    0.000 {built-in method builtins.isinstance}
44934    7.592    0.000    7.592    0.000 {method 'nonzero' of 'numpy.ndarray' objects}
11899203    4.604    0.000    5.136    0.000 {built-in method builtins.getattr}
22471    4.368    0.000    4.450    0.000 {method 'get_indexer' of 'pandas._libs.index.IndexEngine' objects}
44934    3.644    0.000    3.644    0.000 {pandas._libs.algos.take_1d_int64_int64}
22467    3.479    0.000    3.776    0.000 categorical.py:64(f)
643042    3.380    0.000    3.380    0.000 {method 'reduce' of 'numpy.ufunc' objects}
148422    3.016    0.000    3.016    0.000 {method 'copy' of 'numpy.ndarray' objects}
2557465/2534997    2.740    0.000    4.622    0.000 common.py:1835(_get_dtype_type)
337007/292073    2.727    0.000  318.981    0.001 algorithms.py:1545(take_nd)
7351676    2.672    0.000    5.713    0.000 generic.py:7(_check)
404409/359475    2.359    0.000   20.467    0.000 series.py:165(__init__)
8155845/6470735    2.152    0.000    2.873    0.000 {built-in method builtins.len}
471840    2.111    0.000    2.111    0.000 {built-in method numpy.core.multiarray.empty}
89874    2.109    0.000    4.487    0.000 internals.py:3363(_rebuild_blknos_and_blklocs)
2610562    2.028    0.000    5.330    0.000 base.py:61(is_dtype)
1046990/1002055    1.888    0.000    2.184    0.000 {built-in method numpy.core.multiarray.array}
44935    1.830    0.000    1.830    0.000 {method 'take' of 'numpy.ndarray' objects}
4713590    1.757    0.000    1.773    0.000 {built-in method builtins.hasattr}
134826/89883    1.750    0.000    7.191    0.000 base.py:250(__new__)
247139    1.647    0.000    6.430    0.000 cast.py:257(maybe_promote)
1087307    1.523    0.000    5.693    0.000 common.py:1688(is_extension_array_dtype)
134802/89868    1.475    0.000   18.290    0.000 datetimes.py:329(__new__)
2233231    1.437    0.000    2.797    0.000 :997(_handle_fromlist)
718950    1.411    0.000    2.764    0.000 generic.py:4374(__setattr__)
763890/629088    1.361    0.000    2.083    0.000 {method 'format' of 'str' objects}
224670    1.314    0.000  254.948    0.001 internals.py:1237(take_nd)
629082    1.249    0.000    2.836    0.000 internals.py:116(__init__)
314539    1.145    0.000  303.889    0.001 frame.py:2661(__getitem__)
...
67401    0.634    0.000  134.602    0.002 ops.py:1175(wrapper)
44934    0.625    0.000  277.776    0.006 internals.py:4518(take)
...
67404    0.558    0.000  101.225    0.002 categorical.py:267(__init__)
...
89868    0.527    0.000   11.864    0.000 datetimelike.py:523(take)
44934    0.522    0.000  285.513    0.006 generic.py:2780(_take)
1392955    0.517    0.000    0.599    0.000 internals.py:352(dtype)
44934/22467    0.514    0.000   12.605    0.001 {built-in method _operator.ge}
...
44937    0.425    0.000  264.034    0.006 internals.py:4423()
...
763912    0.296    0.000    1.993    0.000 numeric.py:424(asarray)
44934    0.295    0.000   22.606    0.001 datetimes.py:109(wrapper)
381946    0.289    0.000    2.251    0.000 {method 'any' of 'numpy.ndarray' objects}
89892    0.286    0.000    0.960    0.000 common.py:298(_asarray_tuplesafe)
67404    0.281    0.000    1.299    0.000 cast.py:971(maybe_cast_to_datetime)
89868    0.280    0.000    2.962    0.000 internals.py:4108(get)
629082    0.276    0.000    0.276    0.000 internals.py:127(_check_ndim)
44934    0.276    0.000  295.073    0.007 frame.py:2704(_getitem_array)
107568    0.267    0.000    2.324    0.000 missing.py:494(na_value_for_dtype)
179742    0.264    0.000    0.827    0.000 dtypes.py:459(construct_from_string)
44937    0.260    0.000  270.290    0.006 internals.py:4388(reindex_indexer)
...
67401    0.220    0.000  128.302    0.002 ops.py:1073(dispatch_to_index_op)
...
1    0.194    0.194  483.930  483.930 :1()
...
44934    0.179    0.000   20.583    0.000 base.py:89(cmp_method)
...
44934    0.110    0.000   10.841    0.000 datetimes.py:39(_maybe_cache)
...
1    0.003    0.003  483.719  483.719 :61(processDataArrayWrapper)
...
1    0.000    0.000  483.930  483.930 {built-in method builtins.exec}


Note: I deleted a bunch of lines where cumtime was little. I believe the most time was consumed in take_2d_axis1_object_object, but I'm having difficulties in understanding the problem. Which part of the code is causing this problem? 

I'm running the code on my Mac book pro with Intel Core i7 2,7 GHz with 4 cores, and the python process uses nearly 400% of CPU, which means that it somehow uses multi core in the back.

python code with large pandas DataFrame is to slow

Answers (1)

Related Questions