sklearn plot decision boundary for tfidf binary LogisticRegression classifier

Question

let's say we have this very simple (and dumb) LogisticRegression model...

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn2pmml.feature_extraction.text import Splitter
import pandas as pd
import numpy as np

data_col1 = fetch_20newsgroups(
    subset='train',
    categories=['alt.atheism'],
    remove=('headers', 'footers', 'quotes')
)
data_col2 = fetch_20newsgroups(
    subset='train',
    categories=['sci.space'],
    remove=('headers', 'footers', 'quotes')
)

data = pd.DataFrame({
    "col1": data_col1.data[:100],
    "col2": data_col2.data[:100]
})
labels = np.random.randint(2, size=100)
train_data, test_data, train_labels, test_labels = train_test_split(
    data,
    labels,
    test_size=0.1,
    random_state=0,
    shuffle=False
)


def title_features_pipeline():
    return Pipeline([
        ('features', TfidfVectorizer(
            analyzer='word',
            stop_words='english',
            use_idf=True,
            # max_df=0.1,
            min_df=0.01,
            norm=None,
            tokenizer=Splitter()
        )),
    ], verbose=True)


pipeline = Pipeline([
        ('features', ColumnTransformer(
            transformers = [
                ('col1-features', title_features_pipeline(), "col1"),
                ('col2-features', "drop", "col2")
            ],
            remainder="drop",
        )),
        ('regression', LogisticRegression(
            multi_class='ovr',
            max_iter=1000
        ))
], verbose=True)

pipeline.fit(train_data, train_labels)
pred = pipeline.predict(test_data)
print('ROC AUC = {:.3f}'.format(roc_auc_score(test_labels, pred)))

I have spent astronomical amounts of time, both, going through stackoverflow example and Github code snippets and I was just not able to get anything to work with my particular case and it drives me crazy, I am sure I am just doing something wrong!

My goal is to plot the decision boundary for this LogisticalRegression classifier, seeing to which class each document belongs to and the boundary splitting the two classes apart on the graph.

Along the way I would like to understand what exactly LogisticalRegression does with the vectors coming out of TfidfVectorizer. This is because all of the examples I have looked through so far plot the decision boundary based on the assumption that only simple scalars are going into the classifier, but in this case we have long vectors (tfidf)... I don't understand how a vector would get transformed into a single value represented on the graph (is it the sum of all scores in the vector? or something else).

sklearn plot decision boundary for tfidf binary LogisticRegression classifier

Answers (1)

Related Questions