Reputation: 79
I have two sets of data train and test. The two data sets have 30213 and 30235 items respectively with 66 dimensions each.
I am trying to apply t-SNE of scikit learn to reduce the dimension to 2. Since the data sets are large and I get MemoryError if I try to process the entire data in one shot, I try to break them into chunks and transform one chunk at a time like this:
tsne = manifold.TSNE(n_components=2, perplexity=30, init='pca', random_state=0)
X_tsne_train = np.array( [ [ 0.0 for j in range( 2 ) ] for i in range( X_train.shape[0] ) ] )
X_tsne_test = np.array( [ [ 0.0 for j in range( 2 ) ] for i in range( X_test.shape[0] ) ] )
d = ( ( X_train, X_tsne_train ), ( X_test, X_tsne_test ) )
chunk = 5000
for Z in d:
x, x_tsne = Z[0], Z[1]
pstart, pend = 0, 0
while pend < x.shape[0]:
if pend + chunk < x.shape[0]:
pend = pstart + chunk
else:
pend = x.shape[0]
print 'pstart = ', pstart, 'pend = ', pend
x_part = x[pstart:pend]
x_tsne[pstart:pend] += tsne.fit_transform(x_part)
pstart = pend
It runs without MemoryError but I find that different runs of the script produce different outputs for the same data items. This could be due to the fit and transform operations happening together on each chunk of data. But if I try to fit on train data with tsne.fit(X_train)
, I get MemoryError
. How to correctly reduce the dimension of all data items in train and test sets to 2 without any incongruence among the chunks?
Upvotes: 7
Views: 4944
Reputation: 2487
I am not entirely certain what you mean by "different outputs with the same data items", but here are some comments that might help you.
First, t-SNE is not really a "dimension reduction" technique in the same sense that PCA or other methods are. There is no way to take a fixed, learned t-SNE model and apply it to new data. (Note that the class has no transform()
method, only fit()
and fit_transform()
.) You will, therefore, be unable to use a "train" and "test" set.
Second, each and every time you call fit_transform()
you are getting a completely different model. The meaning of your reduced dimensions is, therefore, not consistent from chunk to chunk. Each chunk has its own little low-dimensional space. The model is different each time, and therefore the data are not being projected into the same space.
Third, you don't include the code where you divide "train" from "test". It may be that, while you are being careful to set the random seed of t-SNE, you are not setting the random seed of your train/test division, resulting in different data divisions, and thus different results on subsequent runs.
Finally, if you want to use t-SNE to visualize your data, you might consider following the advice on the documentation page, and applying PCA to reduce the dimensionality of the input from 66 to, say, 15. That would dramatically reduce the memory footprint of t-SNE.
Upvotes: 2