colinfang
colinfang

Reputation: 21767

How to efficiently join a list of values to a list of intervals?

I have a data frame which can be constructed as follows:

df = pd.DataFrame({'value':scipy.stats.norm.rvs(0, 1, size=1000), 
           'start':np.abs(scipy.stats.norm.rvs(0, 20, size=1000))})
df['end'] = df['start'] + np.abs(scipy.stats.norm.rvs(5, 5, size=1000))
df[:10]

       start       value        end
0    9.521781   -0.570097    17.708335
1    3.929711   -0.927318    15.065047
2    3.990466    0.756413    4.841934
3    20.676291  -1.418172    28.284301
4    13.084246   1.280723    14.121626
5    29.784740   0.236915    32.791751
6    21.626625   1.144663    28.739413
7    18.524309   0.101871    27.271344
8    21.288152  -0.727120    27.049582
9    13.556664   0.713141    22.136275

Each row represents a value assigned to an interval (start, end)

Now, I would like to get a list of best values occuring at time 10,13,15, ... ,70. (It is similar to the geometric index in SQL if you are familiar with that.)

Below is my 1st attempt in python with pandas, it takes 18.5ms. Can any one help to improve it? (This procedure would be called 1M or more times with different data frames in my program)

def get_values(data):
    data.sort_index(by='value', ascending=False, inplace=True) # this takes 0.2ms
    # can we get rid of it? since we don't really need sort...
    # all we need is the max value for each interval. 
    # But if we have to keep it for simplicity it is ok.
    ret = []
    #data = data[(data['end'] >= 10) & (data['start'] <= 71)]
    for t in range(10, 71, 2):
        interval = data[(data['end'] >= t) & (data['start'] <= t)]
        if not interval.empty:
            ret.append(interval['value'].values[0])
        else:
            for i in range(t, 71, 2):
                ret.append(None)
            break
    return ret
#%prun -l 10 print get_values(df)
%timeit get_values(df)

The 2nd attemp involves decompose pandas into numpy as much as possible, and it takes around 0.7ms

def get_values(data):
    data.sort_index(by='value', ascending=False, inplace=True)
    ret = []
    df_end = data['end'].values
    df_start = data['start'].values
    df_value = data['value'].values
    for t in range(10, 71, 2):
        values = df_value[(df_end >= t) & (df_start <= t)]
        if len(values) != 0:
            ret.append(values[0])
        else:
            for i in range(t, 71, 2):
                ret.append(None)
            break
    return ret
#%prun -l 10 print get_values(df)
%timeit get_values(df)

Can we improve further? I guess the next step is algorithm level, both of the above are just naive logic implementations.

Upvotes: 1

Views: 106

Answers (1)

HYRY
HYRY

Reputation: 97331

I don't understand empty process in your code, here is a faster version if ignore your empty process:

import scipy.stats as stats
import pandas as pd
import numpy as np

df = pd.DataFrame({'value':stats.norm.rvs(0, 1, size=1000), 
           'start':np.abs(stats.norm.rvs(0, 20, size=1000))})
df['end'] = df['start'] + np.abs(stats.norm.rvs(5, 5, size=1000))

def get_value(df, target):
    value = df["value"].values
    idx = np.argsort(value)[::-1]
    start = df["start"].values[idx]
    end = df["end"].values[idx]
    value = value[idx]

    mask = (target[:, None] >= start[None, :]) & (target[:, None] <= end[None, :])
    index = np.argmax(mask, axis=1)
    flags = mask[np.arange(len(target)), index]
    result = value[index]
    result[~flags] = np.nan
    return result

get_value(df, np.arange(10, 71, 2))

Upvotes: 2

Related Questions