Reputation: 5117
This is my code:
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import LabelEncoder, MaxAbsScaler
from sklearn.metrics import precision_recall_fscore_support
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import csr_matrix, hstack
import os
sgd_classifier = SGDClassifier(loss='log', penalty='elasticnet', max_iter=30, n_jobs=60, alpha=1e-6, l1_ratio=0.7, class_weight='balanced', random_state=0)
vectorizer = TfidfVectorizer(analyzer="char_wb", ngram_range=(4,4), min_df=10)
X_train = vectorizer.fit_transform(X_text_train.ravel())
X_test = vectorizer.transform(X_text_test.ravel())
print('TF-IDF number of features:', len(vectorizer.get_feature_names()))
scaler = MaxAbsScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
print('Inputs shape:', X_train.shape)
sgd_classifier.fit(X_train, y_train)
y_predicted = sgd_classifier.predict(X_test)
y_predicted_prob = sgd_classifier.predict_proba(X_test)
results_report = classification_report(y_test, y_predicted, labels=classes_trained, digits=2, output_dict=True)
df_results_report = pd.DataFrame.from_dict(results_report)
pd.set_option('display.max_rows', 300)
print(df_results_report.transpose())
X_text_train & X_text_test has shape (2M, 2) and (100k, 2) respectively.
They first column is about the description of financial transactions and generally speaking each description consists of 5-15 words; so each line contains about 5-15 words. The second column is a categorical variable which just has the name of the bank related to this bank transaction.
I merge these two columns in one description so now X_text_train & X_text_test has shape (2M, ) and (100k, ) respectively.
Then I apply TF-IDF and now X_text_train & X_text_test has shape (2M, 50k) and (100k, 50k) respectively.
What I observe is that when there is an unseen value on the second column (so a new bank name in the merged description) then the SGDClassifier returns some very different and quite random predictions than what it would return if I had entirely dropped the second column with the bank names.
The same occurs if I do the TF-IDF only on the descriptions and keep the bank names separately as a categorical variable.
Why this happens with SGDClassifier
?
Is it that SGD in general cannot handle well at all unseen values because of the fact that it converges in this stochastic way ?
The interesting thing is that on TF-IDF the vocabulary is predetermined so unseen values in the test set are basically not taken into account at all in the features (ie all the respective features just have 0 as a value) but still the SGD breaks.
(I posted also this on skLearn's Github https://github.com/scikit-learn/scikit-learn/issues/21906)
Upvotes: 1
Views: 208
Reputation: 40149
X_text_train & X_text_test has shape (2M, 2) and (100k, 2) respectively and after the TF-IDF their shape is (2M, 50k) and (100k, 50k) respectively.
This I do not understand: in scikit-learn, text vectorizers are not expected to accept 2D inputs. They expect an iterable of str
objects:
So it's not possible for X_text_train
to have a shape other than (n_documents,)
.
X_train = vectorizer.fit_transform(X_text_train.ravel())
X_test = vectorizer.transform(X_text_test.ravel())
This does not make any sense to me: np.array([["a", "b"], ["c", "d"]], dtype=object).ravel()
will return array(['a', 'b', 'c', 'd'], dtype=object)
. So this would generate 2 rows per original row in X_text_train
.
Maybe you wanted to do something like the following?
X_concat_text_train = [x[0] + " " + x[1] for x in X_text_train]
Why this happens with SGDClassifier?
It's not really possible to answer your question precisely without having access to a minimal reproducible example with either minimal synthetic data or publicly available data.
Is it that SGD in general cannot handle well at all unseen values because of the fact that it converges in this stochastic way ?
You can answer the question by yourself by replacing SGDClassifier
by LogisticRegression
that uses the LBFGS solver that is non-stochastic.
Upvotes: 0