Emission search based on Isolation Forest emission search methods with network data. Why is the accuracy so low?

Question

I wrote a code that loads data from a CIC-IDS dataframe with network traffic data, containing 79 signs and 1 column with labels (BENING is the norm, the rest is an outlier, an attack). Based on this, it is necessary to train the Isolation Forest model so that it can determine emissions. The code is written, but it works poorly. The accuracy is not very good, and the completeness and F1 measure are terrible. If there are those who know very well about this, I will be very grateful. The code is big, but suddenly someone will be found. I can't figure out exactly where I was wrong.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
import warnings
from sklearn.preprocessing import StandardScaler

# Путь к файлу CIC-IDS
file_path = 'C:/Users/berta/Desktop/УЧЕБА/ВКР/Файлы/MachineLearningCVE/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv'

# Загрузка данных
df = pd.read_csv(file_path)

# Проверка первых строк данных
print(df.head())

df.columns = df.columns.str.strip().str.replace(' ', '_')

# Разделяем данные и метки
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

y = y.astype(str)
y = y.apply(lambda x: 1 if x == 'BENIGN' else 0)

selected_columns = ['min_seg_size_forward', 'Init_Win_bytes_forward','Init_Win_bytes_backward', 'Flow_IAT_Min',
                  'Flow_IAT_Mean', 'Flow_IAT_Std', 'Flow_Duration', 'Fwd_Header_Length', 'Fwd_IAT_Mean',
                  'Fwd_IAT_Max', 'Fwd_Packets/s', 'Bwd_Header_Length', 'Bwd_IAT_Min', 'Bwd_Packets/s']

X_columns = df[selected_columns]

scaler = StandardScaler()
X_selected = scaler.fit_transform(X_columns)

# Анализ меток
label_counts = y.value_counts()
print(label_counts)

# Визуализация меток
plt.figure(figsize=(10, 5))
sns.countplot(x=y)
plt.title('Количество меток в данных')
plt.xticks(rotation=90)
plt.show()

# Обучение модели
model = IsolationForest(contamination='auto', random_state=42)
model.fit(X_selected)
y_pred = model.predict(X_selected)

# Визуализация предсказанных меток
plt.figure(figsize=(10, 5))
sns.countplot(x=y_pred)
plt.title('Количество предсказанных меток (0 - выбросы, 1 - нормальные)')
plt.xticks(rotation=90)
plt.show()

# Преобразование предсказаний: 1 - нормальные, -1 - выбросы
y_pred = [1 if pred == 1 else 0 for pred in y_pred]

# Оценка результатов
print("Accuracy:", accuracy_score(y, y_pred))
print(classification_report(y, y_pred))
print("Confusion Matrix:
", confusion_matrix(y, y_pred))

from sklearn.model_selection import train_test_split

# Разделение данных на обучающий и тестовый наборы
selected_columns = ['min_seg_size_forward', 'Init_Win_bytes_forward','Init_Win_bytes_backward', 'Flow_IAT_Min',
                  'Flow_IAT_Mean', 'Flow_IAT_Std', 'Flow_Duration', 'Fwd_Header_Length', 'Fwd_IAT_Mean',
                  'Fwd_IAT_Max', 'Fwd_Packets/s', 'Bwd_Header_Length', 'Bwd_IAT_Min', 'Bwd_Packets/s']

X_columns = df[selected_columns]
scaler = StandardScaler()
X_selected = scaler.fit_transform(X_columns)

X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)

# Обучение модели на обучающем наборе
model = IsolationForest(contamination='auto', random_state=42)
model.fit(X_train)

# Предсказание на тестовом наборе
y_pred = model.predict(X_test)

# Преобразование предсказаний: 1 - нормальные, -1 - выбросы
y_pred = [1 if pred == 1 else 0 for pred in y_pred]

# Оценка результатов
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
print("Confusion Matrix:
", confusion_matrix(y_test, y_pred))

I tried using data normalization, clearing of Nan and infinities - nothing helped. The accuracy reached 90%, and somewhere by 30%, but everywhere the density and F1-measure are low.

Emission search based on Isolation Forest emission search methods with network data. Why is the accuracy so low?

Answers (0)

Related Questions