Reputation: 159
There is a test sentence and a reference sentence. How can I write a Python script that measures similarity between these two sentences in the form of BLEU metric used in automatic machine translation evaluation?
Upvotes: 15
Views: 66695
Reputation: 2836
If anyone using tensor flow. You have to calculate y_true and y_pred..
Example :
ENGLISH INPUT (y_true some vector of following sentence) - I loved the movie very much.
FRENCH OUTPUT (y_pred some vector of following sentence and you can use tf.argmax() for getting highest probability) - j'ai beaucoup aimé le film
class BLEU(tf.keras.metrics.Metric):
def __init__(self, name='bleu_score'):
super(BLEU, self).__init__()
self.bleu_score = 0
def update_state(self, y_true, y_pred, sample_weight=None):
y_pred = tf.argmax(y_pred, -1)
self.bleu_score = 0
for i, j in zip(y_pred, y_true):
tf.autograph.experimental.set_loop_options()
total_words = tf.math.count_nonzero(i)
total_matches = 0
for word in i:
if word == 0:
break
for q in range(len(j)):
if j[q] == 0:
break
if word == j[q]:
total_matches += 1
j = tf.boolean_mask(j, [False if y == q else True for y in range(len(j))])
break
self.bleu_score += total_matches / total_words
def result(self):
return self.bleu_score / BATCH_SIZE
Upvotes: 0
Reputation: 1863
The BLEU score consists of two parts, modified precision and brevity penalty.
Details can be seen in the paper.
You can use the nltk.align.bleu_score
module inside the NLTK.
One code example can be seen as below:
import nltk
hypothesis = ['It', 'is', 'a', 'cat', 'at', 'room']
reference = ['It', 'is', 'a', 'cat', 'inside', 'the', 'room']
#there may be several references
BLEUscore = nltk.translate.bleu_score.sentence_bleu([reference], hypothesis)
print(BLEUscore)
Note that the default BLEU score uses n=4 which includes unigrams to 4 grams. If your sentence is smaller than 4, you need to reset the N value, otherwise ZeroDivisionError: Fraction(0, 0)
error will be returned.
So, you should reset the weight like this:
import nltk
hypothesis = ["open", "the", "file"]
reference = ["open", "file"]
#the maximum is bigram, so assign the weight into 2 half.
BLEUscore = nltk.translate.bleu_score.sentence_bleu([reference], hypothesis, weights = (0.5, 0.5))
print(BLEUscore)
Upvotes: 23
Reputation: 11
I can show some examples of how to calculate BLEU score if test and reference sentences are known.
You can even take both sentences as input in the form of a string and convert to lists.
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'cat',"is","sitting","on","the","mat"]]
test = ["on",'the',"mat","is","a","cat"]
score = sentence_bleu( reference, test)
print(score)
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'cat',"is","sitting","on","the","mat"]]
test = ["there",'is',"cat","sitting","cat"]
score = sentence_bleu( reference, test)
print(score)
Upvotes: 0
Reputation: 536
The following is the code for calculating Bleu
score between two files.
from nltk.translate.bleu_score import sentence_bleu
import argparse
def argparser():
Argparser = argparse.ArgumentParser()
Argparser.add_argument('--reference', type=str, default='summaries.txt', help='Reference File')
Argparser.add_argument('--candidate', type=str, default='candidates.txt', help='Candidate file')
args = Argparser.parse_args()
return args
args = argparser()
reference = open(args.reference, 'r').readlines()
candidate = open(args.candidate, 'r').readlines()
if len(reference) != len(candidate):
raise ValueError('The number of sentences in both files do not match.')
score = 0.
for i in range(len(reference)):
score += sentence_bleu([reference[i].strip().split()], candidate[i].strip().split())
score /= len(reference)
print("The bleu score is: "+str(score))
Use the command python file_name.py --reference file1.txt --candidate file2.txt
Upvotes: 4
Reputation: 83177
You may want to use the python package SacréBLEU (Python 3 only):
SacréBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich's
multi-bleu-detok.perl
, it produces the official WMT scores but works with plain text. It also knows all the standard test sets and handles downloading, processing, and tokenization for you.Why use this version of BLEU?
- It automatically downloads common WMT test sets and processes them to plain text
- It produces a short version string that facilitates cross-paper comparisons
- It properly computes scores on detokenized outputs, using WMT (Conference on Machine Translation) standard tokenization
- It produces the same values as official script (
mteval-v13a.pl
) used by WMT- It outputs the BLEU score without the comma, so you don't have to remove it with
sed
(Looking at you,multi-bleu.perl
)
To install: pip install sacrebleu
Upvotes: 5
Reputation: 4101
You are actually asking for two different things. I will try to shed light on each of the questions.
Part I: Computing the BLEU score
You can calculate BLEU score using the BLEU module under nltk
. See here.
From there you can easily compute the alignment score between the candidate and reference sentences.
Part II: Computing the similarity
I would not suggest using the BLEU score as similarity measure between the first candidate and second candidate if you aim to measure the similarity based on the reference sentence.
Now, let me elaborate this. If you calculate a BLEU score for a candidate against a reference, then this score would merely help you understand the similarity between another canditate's BLEU score against the reference sentence, even though the reference sentence remains the same.
If you intend to measure the similarity between two sentences, word2vec would be a better method. You can compute the angular cosine distance between the two sentence vectors to understand their similarity.
For a thorough understanding of what BLEU metric does, I'd suggest reading this as well as this for word2vec similarity.
Upvotes: 12