pesolari
pesolari

Reputation: 105

How to take just the score from HuggingFace Pipeline Sentiment Analysis

I'm quite new to the whole HuggingFace pipeline world, and I have stumbled upon something which I can't figure out. I have googled quite a bit for an answer, but haven't found anything yet, so any help would be great. I am trying to get just the score from the HF pipeline sentiment classifier, not the label, as I want to apply the scores to a dataframe containing many cells of text. I know how to achieve this on just a single sentence, namely like so:

from transformers import pipeline
classifier = pipeline("sentiment-analysis")

result = classifier("This is a positive sentence")[0]
(result['score'])

This gives me the following output:

0.9994597434997559

I know how to apply the classifier to my dataframe. However, when I adapt the code above to the dataframe, like so:

result = df['text'].apply(lambda x: classifier(x[:512]))[0]
df['sentiment'] = result['score']

My code fails on the second line, with the following error:

TypeError: list indices must be integers or slices, not str

Does anyone know how to fix this? I have tried a few things, but I haven't been able to figure it out so far. Any help would be immensely appreciated!

Upvotes: 3

Views: 1831

Answers (2)

R Chang
R Chang

Reputation: 63

If your classifier output looks like this:

[{'label': '1', 'score': 0.9999555349349976}]

then you could extract the score with the following:

result['sentiment'] = df['text'].apply(lambda x: classifier(x[:512]).apply(
  lambda x: classifier(x)).str[0].str['score']

Alternatively:

Get the classifier output:

df['result'] = df['text'].apply(lambda x: classifier(x[:512]))

Extract the score from the output:

df['sentiment'] = df['result'].str[0].str['score']

Upvotes: 1

NaN
NaN

Reputation: 3591

The main issue is that the last part of the first line (i.e., [0]) should be within the outermost bracket such that it is part of your lambda function.

Moreover, the score and your labels comprise redundant information (the selected label is based on the score) and the negative and positive scores substitute each other (e.g., pos = 1 - neg). Consequently, it should be sufficient to extract the score of one label (either positive or negative). This can be done by turning on the return_all_scores flag when constructing or calling the pipeline.

classifier = pipeline("sentiment-analysis", return_all_scores=True)

Afterwards, extracting the positive (or negative) score is straightforward:

df['pos_score'] = df['text'].apply(
  lambda x: classifier(x[:512])[0][1].get('score')
)

Keep also in mind that this is a sequential operation (not batched) and therefore slow. Finally, I would not recommend to directly interpret or use the score. This is a classification such that the “propensity” score or “certainty” is normally not calibrated, and neuronal networks are known to be overconfident (=your scores will be distributed around 0 and 1).

Upvotes: 0

Related Questions