Alapan Kuila
Alapan Kuila

Reputation: 159

Calculate BLEU score in Python

There is a test sentence and a reference sentence. How can I write a Python script that measures similarity between these two sentences in the form of BLEU metric used in automatic machine translation evaluation?

Upvotes: 15

Views: 66695

Answers (6)

ZKS
ZKS

Reputation: 2836

If anyone using tensor flow. You have to calculate y_true and y_pred..

Example :

ENGLISH INPUT (y_true some vector of following sentence) - I loved the movie very much.

FRENCH OUTPUT (y_pred some vector of following sentence and you can use tf.argmax() for getting highest probability) - j'ai beaucoup aimé le film

class BLEU(tf.keras.metrics.Metric):

def __init__(self, name='bleu_score'):
    super(BLEU, self).__init__()
    self.bleu_score = 0

def update_state(self, y_true, y_pred, sample_weight=None):
    y_pred = tf.argmax(y_pred, -1)
    self.bleu_score = 0
    for i, j in zip(y_pred, y_true):
        tf.autograph.experimental.set_loop_options()

        total_words = tf.math.count_nonzero(i)
        total_matches = 0
        for word in i:
            if word == 0:
                break
            for q in range(len(j)):
                if j[q] == 0:
                    break
                if word == j[q]:
                    total_matches += 1
                    j = tf.boolean_mask(j, [False if y == q else True for y in range(len(j))])
                    break

        self.bleu_score += total_matches / total_words

def result(self):
    return self.bleu_score / BATCH_SIZE

Upvotes: 0

ccy
ccy

Reputation: 1863

The BLEU score consists of two parts, modified precision and brevity penalty. Details can be seen in the paper. You can use the nltk.align.bleu_score module inside the NLTK. One code example can be seen as below:

import nltk

hypothesis = ['It', 'is', 'a', 'cat', 'at', 'room']
reference = ['It', 'is', 'a', 'cat', 'inside', 'the', 'room']
#there may be several references
BLEUscore = nltk.translate.bleu_score.sentence_bleu([reference], hypothesis)
print(BLEUscore)

Note that the default BLEU score uses n=4 which includes unigrams to 4 grams. If your sentence is smaller than 4, you need to reset the N value, otherwise ZeroDivisionError: Fraction(0, 0) error will be returned. So, you should reset the weight like this:

import nltk

hypothesis = ["open", "the", "file"]
reference = ["open", "file"]
#the maximum is bigram, so assign the weight into 2 half.
BLEUscore = nltk.translate.bleu_score.sentence_bleu([reference], hypothesis, weights = (0.5, 0.5))
print(BLEUscore)

Upvotes: 23

Aryan Singh
Aryan Singh

Reputation: 11

I can show some examples of how to calculate BLEU score if test and reference sentences are known.

You can even take both sentences as input in the form of a string and convert to lists.

from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'cat',"is","sitting","on","the","mat"]]
test = ["on",'the',"mat","is","a","cat"]
score = sentence_bleu(  reference, test)
print(score)


from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'cat',"is","sitting","on","the","mat"]]
test = ["there",'is',"cat","sitting","cat"]
score = sentence_bleu(  reference, test)
print(score)

Upvotes: 0

Ameet Deshpande
Ameet Deshpande

Reputation: 536

The following is the code for calculating Bleu score between two files.

from nltk.translate.bleu_score import sentence_bleu
import argparse

def argparser():
    Argparser = argparse.ArgumentParser()
    Argparser.add_argument('--reference', type=str, default='summaries.txt', help='Reference File')
    Argparser.add_argument('--candidate', type=str, default='candidates.txt', help='Candidate file')

    args = Argparser.parse_args()
    return args

args = argparser()

reference = open(args.reference, 'r').readlines()
candidate = open(args.candidate, 'r').readlines()

if len(reference) != len(candidate):
    raise ValueError('The number of sentences in both files do not match.')

score = 0.

for i in range(len(reference)):
    score += sentence_bleu([reference[i].strip().split()], candidate[i].strip().split())

score /= len(reference)
print("The bleu score is: "+str(score))

Use the command python file_name.py --reference file1.txt --candidate file2.txt

Upvotes: 4

Franck Dernoncourt
Franck Dernoncourt

Reputation: 83177

You may want to use the python package SacréBLEU (Python 3 only):

SacréBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich's multi-bleu-detok.perl, it produces the official WMT scores but works with plain text. It also knows all the standard test sets and handles downloading, processing, and tokenization for you.

Why use this version of BLEU?

  • It automatically downloads common WMT test sets and processes them to plain text
  • It produces a short version string that facilitates cross-paper comparisons
  • It properly computes scores on detokenized outputs, using WMT (Conference on Machine Translation) standard tokenization
  • It produces the same values as official script (mteval-v13a.pl) used by WMT
  • It outputs the BLEU score without the comma, so you don't have to remove it with sed (Looking at you, multi-bleu.perl)

To install: pip install sacrebleu

Upvotes: 5

Semih Yagcioglu
Semih Yagcioglu

Reputation: 4101

You are actually asking for two different things. I will try to shed light on each of the questions.

Part I: Computing the BLEU score

You can calculate BLEU score using the BLEU module under nltk. See here.

From there you can easily compute the alignment score between the candidate and reference sentences.

Part II: Computing the similarity

I would not suggest using the BLEU score as similarity measure between the first candidate and second candidate if you aim to measure the similarity based on the reference sentence.

Now, let me elaborate this. If you calculate a BLEU score for a candidate against a reference, then this score would merely help you understand the similarity between another canditate's BLEU score against the reference sentence, even though the reference sentence remains the same.

If you intend to measure the similarity between two sentences, word2vec would be a better method. You can compute the angular cosine distance between the two sentence vectors to understand their similarity.

For a thorough understanding of what BLEU metric does, I'd suggest reading this as well as this for word2vec similarity.

Upvotes: 12

Related Questions