Stat-R
Stat-R

Reputation: 5270

Spedup distance and summary computation between two HUGE multi-dimensional arrays in python

I have only a year of experience with using python. I would like to find summary statistics based on two multi-dimensional arrays DF_All and DF_On. Both have X,Y values. A function is created that computes distance as sqrt((X-X0)^2 + (Y-Y0)^2) and generates summaries as shown in the code below. My question is: Is there any way to make this code run faster? I would prefer a native python method but other strategies (like numba are also welcomed).

The example (toy) code below takes only 50 milliseconds to run on my windows-7 x64 desktop. But my DF_All has more than 10,000 rows and I need to do this calculation a huge number of times as well resulting in a huge execution time.

import numpy as np
import pandas as pd
import json, random

# create data
KY = ['ER','WD','DF']
DS = ['On','Off']

DF_All = pd.DataFrame({'KY': np.random.choice(KY,20,replace = True),
                       'DS': np.random.choice(DS,20,replace = True),
                       'X': random.sample(range(1,100),20),
                       'Y': random.sample(range(1,100),20)})


DF_On = DF_All[DF_All['DS']=='On']

# function 
def get_values(DF_All,X = list(DF_On['X'])[0],Y = list(DF_On['Y'])[0]):
    dist_vector = np.sqrt((DF_All['X'] - X)**2 +  (DF_All['Y'] - Y)**2) # computes distance

    DF_All = DF_All[dist_vector<35] # filters if distance is < 35
#    print(DF_All.shape)

    DS_summary = [sum(DF_All['DS']==x) for x in  ['On','Off']] # get summary 
    KY_summary = [sum(DF_All['KY']==x) for x in  ['ER','WD','DF']] # get summary 

    joined_summary = DS_summary + KY_summary # join two summary lists 
    return(joined_summary) # return

Array_On = DF_On.values.tolist() # convert to array then to list 
Values = [get_values(DF_All,ZZ[2],ZZ[3]) for ZZ in Array_On] # list comprehension to get DS and KY summary for all rows of Array_On list

Array_Updated = [x + y for x,y in zip(Array_On,Values)] # appending the summary list to Array_On list
Array_Updated = pd.DataFrame(Array_Updated) # converting to pandas dataframe 
print(Array_Updated) 

Upvotes: 1

Views: 59

Answers (1)

Divakar
Divakar

Reputation: 221654

Here's an approach making use of vectorization by getting rid of the looping there -

from scipy.spatial.distance import cdist

def get_values_vectorized(DF_All, Array_On):
    a = DF_All[['X','Y']].values
    b = np.array(Array_On)[:,2:].astype(int)
    v_mask = (cdist(b,a) < 35).astype(int)

    DF_DS = DF_All.DS.values
    DS_sums = v_mask.dot(DF_DS[:,None] == ['On','Off'])

    DF_KY = DF_All.KY.values
    KY_sums = v_mask.dot(DF_KY[:,None] == ['ER','WD','DF'])
    return np.column_stack(( DS_sums, KY_sums ))

Using a bit less memory, a tweaked one -

def get_values_vectorized_v2(DF_All, Array_On):
    a = DF_All[['X','Y']].values
    b = np.array(Array_On)[:,2:].astype(int)
    v_mask = cdist(a,b) < 35

    DF_DS = DF_All.DS.values
    DS_sums = [((DF_DS==x)[:,None] & v_mask).sum(0) for x in  ['On','Off']]

    DF_KY = DF_All.KY.values
    KY_sums = [((DF_KY==x)[:,None] & v_mask).sum(0) for x in  ['ER','WD','DF']]

    out = np.column_stack(( np.column_stack(DS_sums), np.column_stack(KY_sums)))
    return out

Runtime test -

Case #1 : Original sample size of 20

In [417]: %timeit [get_values(DF_All,ZZ[2],ZZ[3]) for ZZ in Array_On]
100 loops, best of 3: 16.3 ms per loop

In [418]: %timeit get_values_vectorized(DF_All, Array_On)
1000 loops, best of 3: 386 µs per loop

Case #2: Sample size of 2000

In [420]: %timeit [get_values(DF_All,ZZ[2],ZZ[3]) for ZZ in Array_On]
1 loops, best of 3: 1.39 s per loop

In [421]: %timeit get_values_vectorized(DF_All, Array_On)
100 loops, best of 3: 18 ms per loop

Upvotes: 1

Related Questions