Djordje Savic
Djordje Savic

Reputation: 71

Inverse locally linear embedding LLE in python

How can i perform inverse locally linear embedding (LLE) using sklearn or other python packages?

I would like to perform classification machine learning algorithms (SVM, neural networks...) on some tabular data X with y being target class variable.

As usual the procedure is the following:

Splitting X and y to X_train, y_train, X_test, y_test. Since I have a large number of parameters (columns). I can reduce the number of parameters using the LLE on X_train in order to obtain X_train_lle. y is a target variable and it does not undergo any transformation. After that, i can simply train a model on X_train_lle. The problem arises when I want to use the trained model on y_test. If LLE is performed on X_test, together with the X_train, it would introduce data leakage. Also if LLE is performed solely on X_test, the new X_test_lle might be completely different since the algorithm is using k nearest neighbours. I guess that the correct procedure should be performing inverse LLE on X_test with the parameters obtained on X_train and then use the classification model on X_test_lle.

I've checked some references and the section 2.4.1 deals with inverse LLE. https://arxiv.org/pdf/2011.10925.pdf

How does one do inverse LLE with python (preferably sklearn)?

Here is a code example:

import numpy as np

from sklearn import preprocessing
from sklearn import svm, datasets
from sklearn.manifold import LocallyLinearEmbedding


### Generating dummy data
n_row = 10000 # these numbers are much bigger for the real problem
n_col = 50 #
X = np.random.random(n_row, n_col)
y = np.random.randint(5, size=n_row) # five different classes labeled from 0 to 4

### Preprocessing ###
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size = 0.5, random_state = 1)
#standardization using StandardScaler applied to X_train and then scaling X_train and X_test
scaler = preprocessing.StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

### Here is the part with LLE ###
# We reduce the parameter space to 10 with 15 nearest neighbours
X_train_lle = LocallyLinearEmbedding(n_neighbors=15, n_components=10, method='modified', eigen_solver='dense')

### Here is the training part ###
# we want to apply SVM to transformed data X_train_lle
#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel

#Train the model using the training sets
clf.fit(X_train_lle, y_train)


# Here should go the code to do inverse LLE on X_test 
#i.e. where do values of X_test_lle fit in the manufold X_train_lle


### After the previous part of the code was successfully solved by stackoverflow community :)
#Predict the response for test dataset
y_pred = clf.predict(X_test_lle)

Upvotes: 1

Views: 957

Answers (2)

tschala
tschala

Reputation: 26

It is possible to do inverse transform using the method in the paper you mentioned (Ghojogh et al. (2020) - Sec. 2.4.1) and other papers to (e.g., Franz et al. (2014) - Sec. 4.1 ). The idea is that you find the k-nearest neighbors in the embedded space then express each point as a linear combination of its neighbors in the embedded space. Then keep the weights obtained and use the same weights to express each point in the original point as a combination of its k-nearest neighbors. Obviously the same number of neighbors should be used as in the original forward LLE.

The code would look something like this using the barycenter_kneighbor_graph function:

from sklearn.manifold._locally_linear import barycenter_kneighbors_graph
# calculate the weights for expressing each point in the embedded space as a linear combination of its neighbors
W = barycenter_kneighbors_graph(Y, n_neighbors = k, reg = 1e-3)
# reconstruct the data points in the high dimensional space from its neighbors using the weights calculated based on the embedded space
X_reconstructed = W @ X

where Y is the result of the original LLE embedding (this is X_train_lle in your code snippet), X is the original data matrix and k is the number of nearest neighbors.

Upvotes: 1

TC Arlen
TC Arlen

Reputation: 1482

Just like any scikit-learn transformer, objects created from class from sklearn.manifold import LocallyLinearEmbedding implement a .fit() and .transform() method. (In your example, StandardScaler is another example of a Transformer). Therefore, to find the embedding space using data from X_train you can do:

embedding = LocallyLinearEmbedding(n_neighbors=15, n_components=10, method='modified', eigen_solver='dense')
embedding.fit(X_train)

# Transform X_train using LLE fit above
X_train_lle = embedding.transform(X_train)

# transform X_test, using LLE fit on X_train:
X_test_lle = embedding.transform(X_test)

print('train: ', X_train.shape, X_train_lle.shape)
print('test: ', X_test.shape, X_test_lle.shape)

Upvotes: 0

Related Questions