Reputation: 11
I am working with the affinity propagation algorithm and I want to write it from scratch without using scikit-learn. I have written the responsibility and availability matrices with nested for loops or list comprehension but the execution time of each one is more than 30 minutes with data of more than 2000 individuals.
from scipy.spatial.distance import euclidean, pdist, squareform
import numpy as np
import pandas as pd
def similarity_func(u, v):
return -euclidean(u,v)
csv_data = pd.read_csv("DadosC.csv", delimiter=",", encoding="utf8", engine="python")
X = csv_data[{"InicialL", "FinalL"}].to_numpy().copy()
dists = pdist(X, similarity_func)
distM = np.array(squareform(dists))
np.fill_diagonal(distM, np.median(distM))
distM
A = np.zeros((X.shape[0],X.shape[0]))
def Respo(A,S,i,j):
a_0 = np.delete(A[i],j)
s_0 = np.delete(S[i],j)
return S[i][j]-max(a_0+s_0)
Lis = [[Respo(A,distM,i,j) for i in range(X.shape[0])] for j in range(X.shape[0])]
Res = np.reshape(Lis,(X.shape[0],X.shape[0])).T
This is what I have, A and S are 2000x2000 array, A is initialized as null but is then updated with a similar function. When X is a 2000x2 array it takes too long to calculate. What alternative can you think of?
Upvotes: 1
Views: 226
Reputation: 42
Python is not designed with execution performance in mind and the vast majority of SciKit are wrappers for highly matured C modules. While the intent is admirable I would recommend becoming more familiar with how this library works in the context of transforming python data types into C equivalents.
In order to compete with the library's speed you would first want to understand how they are calling and organizing data structures and then you would have to try and improve upon them not impossible but highly unlikely given the quote from the FAQ page:
We only consider well-established algorithms for inclusion. A rule of thumb is at least 3 years since publication, 200+ citations, and wide use and usefulness. A technique that provides a clear-cut improvement (e.g. an enhanced data structure or a more efficient approximation technique) on a widely-used method will also be considered for inclusion.
Upvotes: 1
Reputation: 1003
You can try to use the built-in map function, I would be able to further demonstrate had you posted 'A' and 'distM' but feel free to look into the function. https://docs.python.org/3/library/functions.html#map Generally when ever map is applicable it is preferred since looping in python isn't as fast.
Upvotes: -1
Reputation: 460
SciKit-Learn is a wrapper around C-libraries which allows for multi-threading and faster execution speed in general. It also has many built in optimizations...
If you are trying to compete with the speed of SciKit-Learn, you will probably need to code in C instead of Python.
Upvotes: 2