Reputation: 442
I'm looking to create a function that calculates the Manhattan distance between a selected category and all the other categories in a dataset. The function should then return the CATEGORY with the lowest distance from the selected.
df = pd.DataFrame(np.random.randint(0,100, size= (10,4)), columns=list('ABCD'))
df['category']= ['apple','orange','grape','berry','strawberry','banana','kiwi','lemon','lime','pear']
The code below returns the smallest 4 distances which includes the selected category (distance = 0; which is redundant and not needed). I need the code to only return the lowest 3 distances as a list of categories, the first being the smallest.
def distance(row):
cols = list('ABCD')
return (df[cols] - row[cols]).abs().sum(axis=1)
df.set_index('category', inplace=True)
dist = df.apply(distance, axis=1)
dist['apple'].nsmallest(4)
For instance, if "Apple" was selected, and the three lowest distances from Apple were Berry, Orange and Grape, return should look like this: ["Berry", "Orange","Grape"]
Upvotes: 3
Views: 277
Reputation: 2598
One option is to use the function cityblock
from scipy.spatial.distance
:
from scipy.spatial import distance
df.set_index('category', inplace = True)
>> df.apply(lambda x: distance.cityblock(x, df.loc['apple',:]), axis=1
).drop('apple', axis=1).nsmallest(4).index.values.tolist()
['strawberry', 'berry', 'kiwi', 'orange']
Basically, you get the distance from each row to the one selected. Then you drop the row containing the selected label and pick the index of the smallest distances.
Upvotes: 0
Reputation: 1604
Setup:
df = pd.DataFrame(np.random.randint(0,100, size= (10,4)), columns=list('ABCD'))
df['category']= . ['apple','orange','grape','berry','strawberry','banana','kiwi','lemon','lime','pear']
df.set_index('category', inplace = True)
It's a mouthful but:
lowest_3 = [df.index[pd.Series([abs(df.loc[ind1] - df.loc[ind2]).sum() for ind2 in df.index]).argsort()[1:4]].tolist() for ind1 in df.index]
lowest_3_series = pd.Series(lowest_3, index = df.index)
lowest_3_series['apple'] = ['banana', 'lemon', 'grape'] # Results will differ due to randomness obviously
This will get you a list of the lowest 3 values for each value in df.index.
For instance, the first element of this list is your solution for 'apple'
Explanation:
First, you create a list comprehension for each index in df.index. This nested list comprehension is an iterations over df.index again. You call df for this index and compare them all pairwise (making n^2 comparisons total) . You compare each index by taking the absolute value between their column values and summing them. Then turn this list into a series and useargsort
to fetch the first 3 (excluding the reflexive comparison which is always 0). Then you call df.index on this slice of indices, which gets you the names of these lowest 3 values.
Upvotes: 1