William
William

Reputation: 141

Classfying an NLP solution above a confidence threshold

I have the dataframe all_data which contains two columns: Event_Summary: Which is a textual description of the event and Impactwhich is the classification. I have used a SVM to auto classify this data - see code below:

train, test = train_test_split(all_data, test_size=0.2)

count_vect = CountVectorizer(stop_words='english', analyzer = "word")
X_train_counts = count_vect.fit_transform(train.Event_Summary)

tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)

y_train = train["Impact"]
y_test = test["Impact"]

X_test_counts = count_vect.transform(test.Event_Summary)
tf_transformer = TfidfTransformer(use_idf=False).fit(X_test_counts)
X_test_tf = tf_transformer.transform(X_test_counts)

parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC(gamma="scale")
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(X_train_tf, y_train)

pred = clf.predict(X_test_tf)
score = np.mean(pred == y_test)*100

The score comes out to be about 70% which is pretty low considering there are just two categories. Because of this low score, I would like to only classify the description if the algorithm is above a certain confidence threshold that it is the right catogarisation (leaving the uncertain ones for me to fill in manually).

Is this possible with python / sklearn and if so does anyone have advice on how to do this? Also if anyone has a recommendation as to how I can make my model more accurate?

Upvotes: 1

Views: 155

Answers (1)

rishi
rishi

Reputation: 2554

You won't come to know what the confidence score is unless you run the classifier. So you can run the classifier, look at the score and then decide which one get automatically accepted and which ones go through manual review.

Regarding your second question on how to improve accuracy there are a couple of things that you can do.

  1. Try using more sophisticated techniques like word embeddings to vectorize your data. You will surely see better results with it.
  2. Try using different classifiers to see which one gives you best results.

Also, may be look at predict_proba function to get confidence.

Upvotes: 1

Related Questions