D500
D500

Reputation: 442

Calculate Manhattan Distance return Category with lowest distance

I'm looking to create a function that calculates the Manhattan distance between a selected category and all the other categories in a dataset. The function should then return the CATEGORY with the lowest distance from the selected.

df = pd.DataFrame(np.random.randint(0,100, size= (10,4)), columns=list('ABCD'))
df['category']= ['apple','orange','grape','berry','strawberry','banana','kiwi','lemon','lime','pear']

The code below returns the smallest 4 distances which includes the selected category (distance = 0; which is redundant and not needed). I need the code to only return the lowest 3 distances as a list of categories, the first being the smallest.

def distance(row):
    cols = list('ABCD')
    return (df[cols] - row[cols]).abs().sum(axis=1)

df.set_index('category', inplace=True)
dist = df.apply(distance, axis=1)

dist['apple'].nsmallest(4)

For instance, if "Apple" was selected, and the three lowest distances from Apple were Berry, Orange and Grape, return should look like this: ["Berry", "Orange","Grape"]

Upvotes: 3

Views: 277

Answers (2)

Mabel Villalba
Mabel Villalba

Reputation: 2598

One option is to use the function cityblock from scipy.spatial.distance:

from scipy.spatial import distance

df.set_index('category', inplace = True)

>> df.apply(lambda x: distance.cityblock(x, df.loc['apple',:]), axis=1
        ).drop('apple', axis=1).nsmallest(4).index.values.tolist()

 ['strawberry', 'berry', 'kiwi', 'orange']

Basically, you get the distance from each row to the one selected. Then you drop the row containing the selected label and pick the index of the smallest distances.

Upvotes: 0

Brian
Brian

Reputation: 1604

Setup:

df = pd.DataFrame(np.random.randint(0,100, size= (10,4)), columns=list('ABCD'))
df['category']= . ['apple','orange','grape','berry','strawberry','banana','kiwi','lemon','lime','pear']
df.set_index('category', inplace = True)

It's a mouthful but:

lowest_3 = [df.index[pd.Series([abs(df.loc[ind1] - df.loc[ind2]).sum() for ind2 in df.index]).argsort()[1:4]].tolist() for ind1 in df.index]

lowest_3_series = pd.Series(lowest_3, index = df.index)

lowest_3_series['apple'] = ['banana', 'lemon', 'grape'] # Results will differ due to randomness obviously

This will get you a list of the lowest 3 values for each value in df.index.

For instance, the first element of this list is your solution for 'apple'

Explanation:

First, you create a list comprehension for each index in df.index. This nested list comprehension is an iterations over df.index again. You call df for this index and compare them all pairwise (making n^2 comparisons total) . You compare each index by taking the absolute value between their column values and summing them. Then turn this list into a series and useargsort to fetch the first 3 (excluding the reflexive comparison which is always 0). Then you call df.index on this slice of indices, which gets you the names of these lowest 3 values.

Upvotes: 1

Related Questions