Reputation: 141
I have the dataframe all_data
which contains two columns: Event_Summary
: Which is a textual description of the event and Impact
which is the classification. I have used a SVM to auto classify this data - see code below:
train, test = train_test_split(all_data, test_size=0.2)
count_vect = CountVectorizer(stop_words='english', analyzer = "word")
X_train_counts = count_vect.fit_transform(train.Event_Summary)
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
y_train = train["Impact"]
y_test = test["Impact"]
X_test_counts = count_vect.transform(test.Event_Summary)
tf_transformer = TfidfTransformer(use_idf=False).fit(X_test_counts)
X_test_tf = tf_transformer.transform(X_test_counts)
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC(gamma="scale")
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(X_train_tf, y_train)
pred = clf.predict(X_test_tf)
score = np.mean(pred == y_test)*100
The score comes out to be about 70% which is pretty low considering there are just two categories. Because of this low score, I would like to only classify the description if the algorithm is above a certain confidence threshold that it is the right catogarisation (leaving the uncertain ones for me to fill in manually).
Is this possible with python / sklearn and if so does anyone have advice on how to do this? Also if anyone has a recommendation as to how I can make my model more accurate?
Upvotes: 1
Views: 155
Reputation: 2554
You won't come to know what the confidence score is unless you run the classifier. So you can run the classifier, look at the score and then decide which one get automatically accepted and which ones go through manual review.
Regarding your second question on how to improve accuracy there are a couple of things that you can do.
Also, may be look at predict_proba
function to get confidence.
Upvotes: 1