Reputation: 1481
I want to optimize a piece of code that helps me to calculate a nearest neighbour for every item in a given dataset with 100k rows. The dataset contains 50 variable-columns, which helps to describe each row-item and most of cells contains a probability value between 0 - 1.
Question: I am fairly new to python, but would like to know if there are more advanced users who could recommend any better structure in the code below which would help me to speed up this calculation. Currently, the program takes a very long time to complete. Thanks in advance!
import math
import numpy as np
import pandas as pd
from scipy.spatial import distance
from sklearn.neighbors import KNeighborsRegressor
df_set = pd.read_excel('input.xlsx', skiprows=0)
distance_columns = ["var_1",
......,
......,
......,
"var_50"]
def euclidean_distance(row):
inner_value = 0
for k in distance_columns:
inner_value += (row[k] - selected_row[k]) ** 2
return math.sqrt(inner_value)
knn_name_list = []
for i in range(len(df_set.index)):
numeric = df_set[distance_columns]
normalized = (numeric - numeric.mean()) / numeric.std()
normalized.fillna(0, inplace=True)
selected_normalized = normalized[df_set["Filename"] == df_set["Filename"][i]]
euclidean_distances = normalized.apply(lambda row: distance.euclidean(row, selected_normalized), axis=1)
distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
second_smallest = distance_frame.iloc[1]["idx"]
most_similar_to_selected = df_set.loc[int(second_smallest)]["Filename"]
knn_name_list.append(most_similar_to_selected)
print(knn_name_list)
df_set['top_neighbor'] = np.array(knn_name_list)
df_set.to_csv('output.csv', encoding='utf-8', sep=',', index=False)
Upvotes: 2
Views: 674
Reputation: 6260
To give you another idea to @Amine approach, you can also include PCA Transformation
into it ( https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).
This would work like this:
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA
#Loading data
df_set = ...
#Selecting numerical data
numeric = df_set[distance_columns]
#normelizing
normalized = normalize(numeric, norm='l1', axis=1, copy=True, return_norm=False)
#reduce number of compononents (here to 25)
pca = PCA(n_components=25)
pca.fit(normalized)
pcanormalized = pca.fit_transform(normalized)
#Initializing NearestNeighbors
neigh = NearestNeighbors(n_neighbors=5, metric='euclidean', n_jobs=-1)
#Fitting with normilized data
neigh.fit(pcanormalized)
...
second_smallest = ...
#Getting Most similar to your selected data
most_similar_to_selected = neigh.kneighbors(second_smallest)
Upvotes: 0
Reputation: 1261
I would recommand to Use NearestNeighbors. (set n_jobs to -1 to use all processors)
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import normalize
#Loading data
df_set = ...
#Selecting numerical data
numeric = df_set[distance_columns]
#normelizing
normalized = normalize(numeric, norm='l1', axis=1, copy=True, return_norm=False)
#Initializing NearestNeighbors
neigh = NearestNeighbors(n_neighbors=5, metric='euclidean', n_jobs=-1)
#Fitting with normilized data
neigh.fit(normalized)
...
second_smallest = ...
#Getting Most similar to your selected data
most_similar_to_selected = neigh.kneighbors(second_smallest)
Upvotes: 1