Reputation: 445
Short Version: I'm having difficulty reducing the number of dimensions of my training data using PCA. The training data is built for a 2D CNN that classifies images of graphs into three classes.
I'm new to Principle Components Analysis. I have a 2D Convolutional Neural Network that classifies images of graphs (36 by 36 px) into one of three classes, as such:
I realized that most of the pixels are white, so the CNN is very inefficient and takes a long time to train. I became aware of dimensionality reduction techniques and tried to use the PCA. I converted one of my training images to grayscale and visualized the "eigengraph" (shown on left). I then reconstructed the original from the eigengraph (shown on right).
X=grayscale
pca_oliv = PCA(n_components = 36)
X_proj = pca_oliv.fit_transform(X)
print(np.cumsum(pca_oliv.explained_variance_ratio_))
plt.imshow(np.reshape(pca_oliv.components_, (36,36)), cmap=plt.cm.bone, interpolation='nearest')
But I know it can do better. This is with n=36 dimensions. By plotting the explained variance, I find the elbow at 3 dimensions. That means with just 3 dimensions out of 36, I can preserve 91.7% of the variance.
But if I change pca_oliv = PCA(n_components = 36)
to pca_oliv = PCA(n_components = 3)
, all goes haywire: ValueError: cannot reshape array of size 108 into shape (36,36)
. Why? What am I doing wrong?
pip install tensorflow
pip install numpy
pip install matplotlib
"""# Import Libraries"""
# Import Libraries
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Flatten, Conv2D, MaxPooling2D, Dropout
from tensorflow.keras import layers
from tensorflow.keras.utils import to_categorical
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
"""# Load Dataset"""
import pathlib
dataset_url = "*/TrainingSet.tar.gz"
data_dir = tf.keras.utils.get_file(origin = dataset_url,
fname = "TrainingSet",
untar = True)
data_dir = pathlib.Path(data_dir)
"""# Display # Images to check"""
print(list(data_dir.glob('*/*.png')))
image_count = len(list(data_dir.glob('*/*.png')))
print(image_count)
"""# Display sample image"""
pip install sklearn
import numpy as np
import os
import PIL
import PIL.Image
import tensorflow as tf
import tensorflow_datasets as tfds
from sklearn.decomposition import PCA
graphs = list(data_dir.glob('*/*.png'))
PIL.Image.open(str(graphs[6]))
"""# Define Image Dimensions & Batch Size"""
batch_size = 32
img_height = 36
img_width = 36
"""# Create Training & Validation Sets (80%, 20%)"""
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="validation",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
"""# Define 3 Classes"""
class_names = ['Cubic Sinusoidal', 'Linear Sinusoidal', 'Quadratic Sinusoidal']
print(class_names)
"""# Supervised Learning (9 Samples from the Training Set)"""
!pip install skimage
from skimage import data
from skimage.color import rgb2gray
import matplotlib.pyplot as plt
subGraphs = []
plt.figure(figsize=(10, 10))
for images, labels in train_ds.take(1):
for i in range(9):
ax = plt.subplot(3, 3, i + 1)
plt.imshow(images[i].numpy().astype("uint8"))
subGraphs.append(images[i].numpy().astype("uint8"))
plt.title(class_names[labels[i]])
plt.axis("off")
subGraphs = np.array(subGraphs)
print(subGraphs.shape)
grayscale = rgb2gray(subGraphs[1])
print(grayscale.shape)
X=grayscale
pca_oliv = PCA(n_components = 36)
X_proj = pca_oliv.fit_transform(X)
print(np.cumsum(pca_oliv.explained_variance_ratio_))
plt.plot(np.cumsum(pca_oliv.explained_variance_ratio_))
plt.imshow(np.reshape(pca_oliv.components_, (36,36)), cmap=plt.cm.bone, interpolation='nearest')
X_inv_proj = pca_oliv.inverse_transform(X_proj)
X_proj_img = np.reshape(X_inv_proj,(1,36,36))
plt.imshow(X_proj_img[0], cmap=plt.cm.bone, interpolation='nearest')
For reference, here is my Jupyter Notebook: PCA+CNN. If anyone can help, that would be great.
Upvotes: 1
Views: 1998
Reputation: 2720
PCA
is used to reduce number of dimensions at the same time ensuring maximum possible variation is covered by this lower dimensional representation. Now, think about that, how many dimensions you have originally. As, I have seen you have used (36, 36)
gray-scale images. Here, each of the pixels are your original features. Again you have taken 9 images to apply PCA
on them.
As in this case you have number of examples
is less than number of original features
i.e, 9 < 36*36
, hence, you will need no more than 9
principle components to cover full variance. But if your number of examples was greater than the number of features (36*36 = 1296)
therefore you would be able to take n_components
larger values. See here, sklearn.decomposition.PCA and Why are there only n−1 principal components for n data if the number of dimensions is ≥n?
But anyways, I am not going to the deeper details of PCA
, instead I am describing what you need to change in your code.
grayscale = rgb2gray(subGraphs)
print(grayscale.shape)
grayscale = grayscale.reshape((grayscale.shape[0], grayscale.shape[1] * grayscale.shape[2]))
print(grayscale.shape)
As PCA
expects the input shape to be (number of examples, number of features)
hence, you have to keep the the number of examples
in the first dimension, and the second dimension will be the all pixel values (original features). If you have used color image, than you would need to include all channels' features inside this same second dimension, somewhat like:
color_img = color_img.reshape((color_img.shape[0], color_img.shape[1] * color_img.shape[2] * color_img.shape[3]))
print(color_img.shape)
Now you can apply PCA
:
X=grayscale
pca_oliv = PCA(n_components = 9)
X_proj = pca_oliv.fit_transform(X)
print(np.cumsum(pca_oliv.explained_variance_ratio_))
plt.plot(np.cumsum(pca_oliv.explained_variance_ratio_))
Please note that, you won't be able to set n_components
more than 9
as you have used only 9 images. If you see the shape of the X_proj
you will find its shape is (9, 9)
. The first 9
is the number of examples and second 9
is each example is represented in a lower dimensional space which space has 9
dimensions (n_components
).
Finally, do an inverse transform to get the original dimensions back, (it is just for illustration purpose, you will train your model with X_proj
as it is the lower dimensional representation):
X_inv_proj = pca_oliv.inverse_transform(X_proj)
print(X_inv_proj.shape)
for index in range(len(X_inv_proj)): # 9
X_proj_img = np.reshape(X_inv_proj[index],(36,36))
plt.imshow(X_proj_img, cmap=plt.cm.bone, interpolation='nearest')
plt.show()
Again, X_proj
contains lower dimensional representation (9
dimensions) for your 9
examples. As, it is not an image, so you need not reshape it. You can directly use it for training your model, as if these 9
features are representative of your original 36*36
features.
Here, please note that, the inverse transformation is not a lossless transformation always. Here, in your case we have taken 9
principle components (which is maximum possible to take in this case). So, essentially we have taken 100%
variation while getting PCA
, so, when we apply inverse transformation, it will give us the 100%
of the variation back i.e, it will restore the original data. But if we took n_components
to some lower value, therefore the inverse transformation would not be able to restore the original information back completely, although the shape of X_inv_proj
won't be changed, but the information it would hold won't be the complete information of the original data.
Upvotes: 1