How to map nearest values from a dataframe to numpy array efficiently

Question

I have created a dataframe with a a column for codes and another for discrete values. I have a numpy array with some experimental values. How do i create a numpy array of codes that are associated to the nearest value defined in the dataframe.

The dataframe that defines the mapping between the codes and values doesn't have to be a dataframe. I am much more familiar with pandas than numpy, so i tend to lean towards using pandas dataframes. I am very unfamiliar with numpy so not sure what the best way to do this might be.

This is what i have tried and it gives me the correct response. Its just too slow. My actual data set is 500x1500 and i have over 700 sets of data that this operation needs to be performed over, so efficiency and speed are paramount. Ideas? Thoughts? Suggestions? Thanks!

import numpy as np
import pandas as pd
from pandas import DataFrame


def main():
    npsize = (2,4)
    #Create an array of data between -0.75 and 0.25
    data = np.random.uniform(-0.75,0.25,npsize)
    
    #Pandas dataframe that creates a map
    codes = [np.array([1,2,3]),np.array([4,5,6]),np.array([7,8,9]),np.array([10,11,12]),np.array([13,14,15])]
    values = [-0.75,-0.5,-0.25,0,0.25]
    d = {'code':codes, 'value':values}
    data_map = pd.DataFrame(data=d)

    #I need to associate each element of data to the code within the data_map dataframe by looking up the nearest value
    #For example ... -0.05 ---> [10,11,12] 
    
    #Silly Looping approach ... surely there is a better/faster way to do this!
    mapped_data = np.zeros(shape=(2,4,3))
    xctr = 0
    yctr = 0

    while xctr < npsize[0]:
        #print(xctr)
        while yctr < npsize[1]:
            nearest_code = data_map.iloc[(data_map['value']-data[xctr,yctr]).abs().argsort()[:1]].code.iloc[0]
            mapped_data[xctr,yctr] = nearest_code
            yctr = yctr + 1
        yctr = 0
        xctr = xctr + 1

    print (mapped_data)

if __name__ == "__main__":
    main()

Reinderien · Accepted Answer

Don't create codes as a list of arrays of manual values; use a reshaped arange.

Don't create values manually; also use arange.

Avoid holding inner lists; expand your "code" to multiple columns and then an additional dimension in your output array.

And yes, don't loop. Numpy doesn't have anything for this kind of merge but Pandas does - merge_asof. There are sorting requirements, and if you can avoid needing to preserve order after, your code will be faster.

import numpy as np
import pandas as pd
from numpy.random import default_rng


def main() -> None:
    # Pandas dataframe that creates a map
    code_width = 3
    codes = np.arange(1, 16).reshape((-1, code_width))
    d = {
        f'code_{i}': col
        for i, col in enumerate(codes.T)
    }
    data_map = pd.DataFrame(d, index=np.arange(-0.75, 0.5, 0.25))

    rand = default_rng(seed=0)
    givens = rand.uniform(-0.75, 0.25, (2, 4))
    givens_flat = givens.ravel()
    # Givens need to be sorted. If you don't need to preserve original order, then replace argsort with sort.
    order = givens_flat.argsort()
    givens_flat = givens_flat[order]

    merged_givens = pd.merge_asof(
        left=pd.Series(givens_flat, name='givens'), right=data_map,
        left_on='givens', right_index=True, direction='nearest',
    )

    mapped_givens = np.empty((code_width, len(givens_flat)))
    mapped_givens[:, order] = merged_givens.iloc[:, 1:].T
    mapped_givens = mapped_givens.reshape((-1, *givens.shape))

    print('Map:')
    print(data_map)
    print()

    print('Givens:')
    print(givens)
    print()

    print('Mapped:')
    print(mapped_givens)


if __name__ == "__main__":
    main()

Map:
       code_0  code_1  code_2
-0.75       1       2       3
-0.50       4       5       6
-0.25       7       8       9
 0.00      10      11      12
 0.25      13      14      15

Givens:
[[-0.11303831 -0.48021329 -0.70902648 -0.73347236]
 [ 0.06327024  0.16275558 -0.14336422 -0.02050344]]

Mapped:
[[[10.  4.  1.  1.]
  [10. 13.  7. 10.]]

 [[11.  5.  2.  2.]
  [11. 14.  8. 11.]]

 [[12.  6.  3.  3.]
  [12. 15.  9. 12.]]]

How to map nearest values from a dataframe to numpy array efficiently

Answers (1)

Related Questions