Reputation: 105
My dataset consists of visualized binaries. Those binaries are either part of malware family 1
or malware family 2
. Those grayscale images have very specific features. Some examples (upper family 1, lower family 2):
There are 2474 samples of malware family 1
and 2930 samples of malware family 2
.
As we can see, the similarities between samples of the same family are very strong. A CNN should not have too much problems classifying them.
Nonetheless, the CNN that I used only achieves around 50% accuracy (and 0.25 loss). In addition to that, I also implemented the InceptionV3
model. But that model also achieves only 50% accuracy (and 0.50 loss). What could be the error here?
Load images:
idx = 0
for elem in os.listdir(directory):
img = cv2.imread(full_path,cv2.IMREAD_UNCHANGED)
if idx in train_index:
dataset4_x_train.append(img)
dataset4_y_train.append(0)
else:
dataset4_x_test.append(img)
dataset4_y_test.append(0)
dataset4_x_train = np.array(dataset4_x_train)
dataset4_x_test = np.array(dataset4_x_test)
dataset4_x_train = dataset4_x_train.reshape(-1, 192, 192, 1)
dataset4_x_test = dataset4_x_test.reshape(-1, 192, 192, 1)
Custom CNN:
model = Sequential()
model.add(tf.keras.layers.Conv2D(8, 5, activation="relu", input_shape=(192,192,1)))
model.add(tf.keras.layers.MaxPool2D(2))
model.add(tf.keras.layers.Conv2D(8, 3, activation="relu"))
model.add(tf.keras.layers.MaxPool2D(2))
model.add(tf.keras.layers.Conv2D(8, 3, activation="relu"))
model.add(tf.keras.layers.MaxPool2D(2))
model.add(tf.keras.layers.Conv2D(8, 3, activation="relu"))
model.add(tf.keras.layers.MaxPool2D(2))
model.add(tf.keras.layers.Conv2D(16, 3, activation="relu"))
model.add(tf.keras.layers.MaxPool2D(2))
model.add(tf.keras.layers.Conv2D(80, 4, activation="relu"))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(2, activation='softmax'))
opt = tf.keras.optimizers.Adam(lr=0.01)
model.compile(opt, loss="mse",metrics=['accuracy'])
model.fit(dataset4_x_train, dataset4_y_train, epochs=100, batch_size=50)
model.evaluate(dataset4_x_test, dataset4_y_test)
InceptionV3:
incept_v3 = tf.keras.applications.inception_v3.InceptionV3(input_shape=(192,192,1), include_top=False, weights=None)
incept_v3.summary()
last_output = incept_v3.get_layer("mixed10").output
x = tf.keras.layers.Flatten()(last_output)
x = tf.keras.layers.Dense(2, activation="softmax")(x)
model = tf.keras.Model(incept_v3.input, x)
opt = tf.keras.optimizers.Adam(lr=0.001)
model.compile(opt, loss="mse",metrics=['accuracy'])
model.fit(dataset4_x_train, dataset4_y_train, epochs=100, batch_size=50)
model.evaluate(dataset4_x_test, dataset4_y_test)
Upvotes: 0
Views: 132
Reputation: 2066
Your Model is under-fitting the dataset, this is why you have a low accuracy.
Fortunately, increasing the model size fixes the problem.
Again, increasing the model size makes it more vulnerable to overfitting. To fix that issue, I would suggest to use dropout layers as shown below.
This is a binary classification problem, for which binary_crossentropy
loss function will work better, and a low learning to converge to a better accuracy.
model = Sequential()
model.add(tf.keras.layers.Conv2D(16, 3, activation="relu",padding='same', input_shape=(192,192,1)))
model.add(tf.keras.layers.Conv2D(16, 3, activation="relu", padding='same'))
model.add(tf.keras.layers.MaxPool2D(2))
model.add(tf.keras.layers.Conv2D(32, 3, activation="relu", padding='same'))
model.add(tf.keras.layers.Conv2D(32, 3, activation="relu", padding='same'))
model.add(tf.keras.layers.MaxPool2D(2))
model.add(tf.keras.layers.Conv2D(64, 3, activation="relu", padding='same'))
model.add(tf.keras.layers.Conv2D(64, 3, activation="relu", padding='same'))
model.add(tf.keras.layers.MaxPool2D(2))
model.add(tf.keras.layers.Conv2D(92, 3, activation="relu", padding='same'))
model.add(tf.keras.layers.Conv2D(92, 3, activation="relu", padding='same'))
model.add(tf.keras.layers.MaxPool2D(2))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Dense(2, activation='softmax'))
opt = tf.keras.optimizers.Adam(lr=0.0008)
model.compile(opt, loss="binary_crossentropy", metrics=['accuracy'])
model.fit(dataset4_x_train, dataset4_y_train, epochs=100, batch_size=50)
model.evaluate(dataset4_x_test, dataset4_y_test)
Upvotes: 1
Reputation: 839
MSE is normally used for regression problems, and it sounds like your task is moreso classification, so you should use a different loss function. For example, you can use tf.keras.losses.BinaryCrossentropy
. This is most likely the main cause for the low accuracy.
In addition, CNN's normally have more than one hidden linear layer, for example, the following. This would normally have a relatively small performance impact compared to the above.
model = Sequential()
model.add(tf.keras.layers.Conv2D(8, 5, activation="relu", input_shape=(192,192,1)))
model.add(tf.keras.layers.MaxPool2D(2))
model.add(tf.keras.layers.Conv2D(8, 3, activation="relu"))
model.add(tf.keras.layers.MaxPool2D(2))
model.add(tf.keras.layers.Conv2D(8, 3, activation="relu"))
model.add(tf.keras.layers.MaxPool2D(2))
model.add(tf.keras.layers.Conv2D(8, 3, activation="relu"))
model.add(tf.keras.layers.MaxPool2D(2))
model.add(tf.keras.layers.Conv2D(16, 3, activation="relu"))
model.add(tf.keras.layers.MaxPool2D(2))
model.add(tf.keras.layers.Conv2D(80, 4, activation="relu"))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dense(2, activation='softmax'))
Upvotes: 1