Reputation: 1721
let's say we have this very simple (and dumb) LogisticRegression model...
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn2pmml.feature_extraction.text import Splitter
import pandas as pd
import numpy as np
data_col1 = fetch_20newsgroups(
subset='train',
categories=['alt.atheism'],
remove=('headers', 'footers', 'quotes')
)
data_col2 = fetch_20newsgroups(
subset='train',
categories=['sci.space'],
remove=('headers', 'footers', 'quotes')
)
data = pd.DataFrame({
"col1": data_col1.data[:100],
"col2": data_col2.data[:100]
})
labels = np.random.randint(2, size=100)
train_data, test_data, train_labels, test_labels = train_test_split(
data,
labels,
test_size=0.1,
random_state=0,
shuffle=False
)
def title_features_pipeline():
return Pipeline([
('features', TfidfVectorizer(
analyzer='word',
stop_words='english',
use_idf=True,
# max_df=0.1,
min_df=0.01,
norm=None,
tokenizer=Splitter()
)),
], verbose=True)
pipeline = Pipeline([
('features', ColumnTransformer(
transformers = [
('col1-features', title_features_pipeline(), "col1"),
('col2-features', "drop", "col2")
],
remainder="drop",
)),
('regression', LogisticRegression(
multi_class='ovr',
max_iter=1000
))
], verbose=True)
pipeline.fit(train_data, train_labels)
pred = pipeline.predict(test_data)
print('ROC AUC = {:.3f}'.format(roc_auc_score(test_labels, pred)))
I have spent astronomical amounts of time, both, going through stackoverflow example and Github code snippets and I was just not able to get anything to work with my particular case and it drives me crazy, I am sure I am just doing something wrong!
My goal is to plot the decision boundary for this LogisticalRegression classifier, seeing to which class each document belongs to and the boundary splitting the two classes apart on the graph.
Along the way I would like to understand what exactly LogisticalRegression does with the vectors coming out of TfidfVectorizer. This is because all of the examples I have looked through so far plot the decision boundary based on the assumption that only simple scalars are going into the classifier, but in this case we have long vectors (tfidf)... I don't understand how a vector would get transformed into a single value represented on the graph (is it the sum of all scores in the vector? or something else).
Upvotes: 0
Views: 204
Reputation: 2364
Logistic regression will learn a scalar value for each term in the tfidf vectorizer. The vectors are converted to a score by multiplying the weight by the tfidf score and summing them all up.
Plotting decision boundaries is something that is commonly done in two or three dimensions. When you have a text classifier with potentially hundreds of dimensions it is not as clear what it means for you to plot the decision boundary.
Upvotes: 1