Reputation: 15
I'm performing logistic regression on the 20 news group dataset, which is designed for multiclass classification. I've successfully built the model but am now looking to visualize the results for my documentation, similar to the example shown in this image:
.
I'm encountering a challenge with visualization because my x_test
data has been vectorized, and my y_test
consists of classifiers as strings.
How can I effectively visualize these results?
Here's the code I've been working with:
# Load the dataset
import pandas as pd
import numpy as np
import sklearn
import datasets
import matplotlib.pyplot as plt
import matplotlib
# downloads dataset
dataset = datasets.load_dataset("rungalileo/20_Newsgroups_Fixed")
# Filtered dataset drops any "none" values
filtered_train = dataset["train"].filter(lambda x: x['label'] is not None and x['text'] is not None)
filtered_test = dataset["test"].filter(lambda x: x['label'] is not None and x['text'] is not None)
# ID is not relevant for outcome
filtered_test = filtered_test.remove_columns('id')
filtered_train = filtered_train.remove_columns('id')
# SKLearn Tokenizer
import datasets
from sklearn.feature_extraction.text import TfidfVectorizer
# Vectorizer will give us a numerical array with the count/frequency of each word in the dataset (text must be an iterable like an array)
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(text)
print(vectorizer.get_feature_names_out())
print(X.toarray())
# We need to figure out a way to have a consistent vocabulary across training and test sets, but get the vector for each.
tokenized_test = []
tokenized_train = []
# If words appear in less than 5 messages or more than 40% of the messages, then they aren't included
# 10 for min_df cuts down about 8k words, but increases metrics by about a percentage on average compared to 5
freqV = TfidfVectorizer(min_df = 5, max_df = 0.40)
for i in range(filtered_train.num_rows):
tokenized_train.append(filtered_train[i]['text'])
train_vectors = freqV.fit_transform(tokenized_train)
print(freqV.get_feature_names_out())
print(train_vectors.toarray())
for i in range(filtered_test.num_rows):
tokenized_test.append(filtered_test[i]['text'])
test_vectors = freqV.transform(tokenized_test)
# After this code block, we have vectors which represent the count of each word in a text entry,
# according to the vocab of the training set.
# SKLearn's Logistic regression model
from sklearn.linear_model import LogisticRegression
# Increase max iterations to fit datasets
max_iterations = filtered_train.num_rows
# Train > Test
# instantiate the model (using the default parameters)
logreg = LogisticRegression(random_state=16, max_iter=max_iterations)
# We need to determine how this is calculated
x_train = train_vectors
y_train = filtered_train['label']
x_test = test_vectors
y_test = filtered_test['label']
# fit the model with data
logreg.fit(x_train, y_train)
log_pred = logreg.predict(x_test)
Upvotes: 0
Views: 136