Computing classification metrics for sequence labelling task

I intend to calculate accuracy/precision/recall/F1 measures for sentence classification task. I previously have computed it for whole text classification which is quite easy, but got confused at doing it for sentence classification as we perform at sentence-level and not text-/sentence(s)-level. Note that a text might contain several sentences... Here is an example:

Suppose we have the following text, with predicted labels in []:

Seq2seq networks are a good way of learning sequences. [0] They perform reasonably fine at generating long sequences. [1] These networks are utilized in downstream tasks such as NMT and text summarization [0]. blah blah blah [2]

So the prediction is [0, 1, 0, 2] and suppose the gold labels for the sentences above are: [1, 1, 0, 0].

So is the accuracy of this equal to correct / total = (1 + 1) / 4 = 0.5? What about other metrics such as Precision, Recall, and F1? Any ideas?

Upvotes: 0

Answers (3)

inverted_index

Reputation: 2437

While I was eagerly looking for a solution for this, I got some inspirations from a relevant task (i.e., NER) and the definition of Precision and Recall, after the computation of which, F1 score can be easily calculated.

By definition:

I noticed that all I need is computing TP, FP, and FN. For example, for prediction case [0, 0, 1, 1] whose true labels are: [0, 0, 1, 0], TP is 1, FP is 0, and FN is 1. Thus:

Here, since the model performance on the positive class is more important to me, I just try these metrics on the positive class. I also realized that this is the basic usage of F1 metric, but the level of granularity differs with each task. Hope this help anyone who's been puzzled about this issue.

Upvotes: 1

SidharthMacherla

Reputation: 400

The questioner is seeking suggestions on approach to measuring model performance as opposed to programmatic solution using a particular language/library. Hence, following are some questions to think about and suggested approach.

Before attempting to answer the question, let us ask ourselves the following questions. They will help us understand the best approach forward.

Is the classification model a bag of words type of model where the sequence does not matter but only the words in a given sentence. If so, the model can only be measured for what it is built for. This means, that the total number of sentences that are rightly classified divided by the total number of sentences is a good measure of accuracy.
If the classification model is a graph based model such as Hidden Markov Model or Conditional Random Fields, then the question we must ask ourselves is, if multiple sentences are considered as an input before classifying the current sentence. If the answer is yes, then one is better off looking at the entire document for measuring model performance. Something on lines of number of documents classified correctly divided by total number of documents.

As a final note, the question of whether precision, recall or accuracy is the best measurement depends on the trade off one wishes to make and the author would not comment on that.

Upvotes: -1

arpitrathi

Reputation: 177

In case of multi-class classification, you can get the Precision, Recall and F1 score using metrics.classification_report(). You can get there metrics for each individual class as well as their 'macro', 'micro', 'weighted' and 'samples' average as well.

from sklearn import metrics

# True values
y_true = [1,1,0,0]
# Predicted values
y_pred = [0,1,0,2]

# Print the confusion matrix
print(metrics.confusion_matrix(y_true, y_pred))

# Print the precision and recall, among other metrics
print(metrics.classification_report(y_true, y_pred))

Upvotes: 1

Computing classification metrics for sequence labelling task

Answers (3)

Related Questions