Reputation: 71
How can i perform inverse locally linear embedding (LLE) using sklearn or other python packages?
I would like to perform classification machine learning algorithms (SVM, neural networks...) on some tabular data X with y being target class variable.
As usual the procedure is the following:
Splitting X and y to X_train, y_train, X_test, y_test. Since I have a large number of parameters (columns). I can reduce the number of parameters using the LLE on X_train in order to obtain X_train_lle. y is a target variable and it does not undergo any transformation. After that, i can simply train a model on X_train_lle. The problem arises when I want to use the trained model on y_test. If LLE is performed on X_test, together with the X_train, it would introduce data leakage. Also if LLE is performed solely on X_test, the new X_test_lle might be completely different since the algorithm is using k nearest neighbours. I guess that the correct procedure should be performing inverse LLE on X_test with the parameters obtained on X_train and then use the classification model on X_test_lle.
I've checked some references and the section 2.4.1 deals with inverse LLE. https://arxiv.org/pdf/2011.10925.pdf
How does one do inverse LLE with python (preferably sklearn)?
Here is a code example:
import numpy as np
from sklearn import preprocessing
from sklearn import svm, datasets
from sklearn.manifold import LocallyLinearEmbedding
### Generating dummy data
n_row = 10000 # these numbers are much bigger for the real problem
n_col = 50 #
X = np.random.random(n_row, n_col)
y = np.random.randint(5, size=n_row) # five different classes labeled from 0 to 4
### Preprocessing ###
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size = 0.5, random_state = 1)
#standardization using StandardScaler applied to X_train and then scaling X_train and X_test
scaler = preprocessing.StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
### Here is the part with LLE ###
# We reduce the parameter space to 10 with 15 nearest neighbours
X_train_lle = LocallyLinearEmbedding(n_neighbors=15, n_components=10, method='modified', eigen_solver='dense')
### Here is the training part ###
# we want to apply SVM to transformed data X_train_lle
#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel
#Train the model using the training sets
clf.fit(X_train_lle, y_train)
# Here should go the code to do inverse LLE on X_test
#i.e. where do values of X_test_lle fit in the manufold X_train_lle
### After the previous part of the code was successfully solved by stackoverflow community :)
#Predict the response for test dataset
y_pred = clf.predict(X_test_lle)
Upvotes: 1
Views: 957
Reputation: 26
It is possible to do inverse transform using the method in the paper you mentioned (Ghojogh et al. (2020) - Sec. 2.4.1) and other papers to (e.g., Franz et al. (2014) - Sec. 4.1 ). The idea is that you find the k-nearest neighbors in the embedded space then express each point as a linear combination of its neighbors in the embedded space. Then keep the weights obtained and use the same weights to express each point in the original point as a combination of its k-nearest neighbors. Obviously the same number of neighbors should be used as in the original forward LLE.
The code would look something like this using the barycenter_kneighbor_graph
function:
from sklearn.manifold._locally_linear import barycenter_kneighbors_graph
# calculate the weights for expressing each point in the embedded space as a linear combination of its neighbors
W = barycenter_kneighbors_graph(Y, n_neighbors = k, reg = 1e-3)
# reconstruct the data points in the high dimensional space from its neighbors using the weights calculated based on the embedded space
X_reconstructed = W @ X
where Y is the result of the original LLE embedding (this is X_train_lle in your code snippet), X is the original data matrix and k is the number of nearest neighbors.
Upvotes: 1
Reputation: 1482
Just like any scikit-learn transformer, objects created from class from sklearn.manifold import LocallyLinearEmbedding
implement a .fit()
and .transform()
method. (In your example, StandardScaler
is another example of a Transformer). Therefore, to find the embedding space using data from X_train you can do:
embedding = LocallyLinearEmbedding(n_neighbors=15, n_components=10, method='modified', eigen_solver='dense')
embedding.fit(X_train)
# Transform X_train using LLE fit above
X_train_lle = embedding.transform(X_train)
# transform X_test, using LLE fit on X_train:
X_test_lle = embedding.transform(X_test)
print('train: ', X_train.shape, X_train_lle.shape)
print('test: ', X_test.shape, X_test_lle.shape)
Upvotes: 0