技术与理论-BLEU评估 - 藤君的小窝

BLEU（Bilingual Evaluation Understudy）#

初学有诸多疑惑：

如果是多个参考译文，到底该怎么算？
同一个gram出现多次，怎么统计？
下面的代码到底算了什么？

1
from nltk.translate.bleu_score import corpus_bleu
2

3
def evaluate_bleu(model, test_loader):
4
    references = []
5
    hypotheses = []
6
    for src, tgt in tqdm(test_loader, desc="Evaluating BLEU"):
7
        pred_indices = translate(model, src)
8
        tgt_indices = tgt[0].tolist()
9
        tgt_splits = [[t for t in tgt_indices if t not in [PAD_IDX, BOS_IDX, EOS_IDX]]]
10
        pred_splits = [t for t in pred_indices if t not in [PAD_IDX, BOS_IDX, EOS_IDX]]
11
        references.append(tgt_splits)
12
        hypotheses.append(pred_splits)
13
    score = corpus_bleu(references, hypotheses)
14
    print(f"BLEU@4 Score: {score * 100:.2f}")

本文将进行最准确的BLEU解读！

$\text{BLEU} = BP \times \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)$

其中：

N 是最大语法的阶数，实际取4。
$w_n = \frac{1}{N}$ ，表示每个阶数的权重。
$p_n$ 表示出现在参考译文中的 $n$ 元词语接续组占假设译文中 $n$ 元词语接续组总数的比例。

（假设译文：即模型翻译结果。参考译文：即答案译文。）

【关于比例计算这里，一个疑惑在于重复的 $n$ 元词语接续组如何统计，请看后续分析】

长度过短句子的惩罚因子 (Brevity Penalty, BP)

$BP = \begin{cases} 1 & \text{if } c > r \\ e^{(1 - r/c)} & \text{if } c \leq r \end{cases}$

其中：

c 为假设译文中单词的个数。

r 为参考译文单词个数中，与 c 最接近的那个数。

一定要注意，多个参考译文，不是取最好的情况，是取最接近的数。例如：

1
>>> references = [['a'] * 13, ['a'] * 2]
2
>>> hypothesis = ['a'] * 12
3
>>> hyp_len = len(hypothesis)
4
>>> closest_ref_len =  closest_ref_length(references, hyp_len)
5
>>> brevity_penalty(closest_ref_len, hyp_len)
6
0.9200...

但如果一样接近，就取较好情况不必惩罚。或者说，取较短的参考译文长度。

1
>>> references = [['a'] * 13, ['a'] * 11]
2
>>> hypothesis = ['a'] * 12
3
>>> hyp_len = len(hypothesis)
4
>>> closest_ref_len =  closest_ref_length(references, hyp_len)
5
>>> bp1 = brevity_penalty(closest_ref_len, hyp_len)
6
>>> hyp_len = len(hypothesis)
7
>>> closest_ref_len =  closest_ref_length(reversed(references), hyp_len)
8
>>> bp2 = brevity_penalty(closest_ref_len, hyp_len)
9
>>> bp1 == bp2 == 1
10
True

接下来让我们手算一个句子级别的BLEU（单个源文本）

1
>>> hypothesis = ['the', 'cat', 'is', 'on', 'the', 'mat']
2

3
>>> reference1 = ['the', 'cat', 'is', 'on', 'mat']
4
>>> reference2 = ['there', 'is', 'a', 'cat', 'on', 'the', 'mat']
5
>>> reference3 = ['a', 'cat', 'being', 'on', 'the', 'mat']
6
>>> references = [reference1, reference2, reference3]
7

8
>>> bleu_score = sentence_bleu(references, hypothesis)
9
0.6756000774035172

p1 = 5/6，p2 = 5/5，p3 = 3/4，p4 = 1/3，BP = 1

这里解决了我们的疑惑。对于unigram，the在hypothesis出现两次，那么分母就要算两次；the在其他reference中最多出现1次，故分子算1次。如果某个reference的the也出现了两次，那么分子也要算两次。

代入 $\text{BLEU} = BP \times \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)$

即得BLEU = 0.6756000774035172，和程序运行结果一致。

语料库级别（多个源文本）的bleu计算—源码分析

关键词：微平均

1
nltk.translate.bleu_score.corpus_bleu(list_of_references, hypotheses, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=None, auto_reweigh=False)

list_of_references (list(list(list(str)))) – a corpus of lists of reference sentences
hypotheses (list(list(str))) – a list of hypothesis sentences

list_of_references的第i个元素是第i个源文本的若干参考译文，hypotheses的第i个元素是第i个源文本的假设译文。

corpus_bleu()用的是微平均（micro-average）的思想。它与多个 sentence_bleu() 的平均值是不同的。具体表现在corpus_bleu()的源码中， $p_i$ 的计算为p_numerators[i]和p_denominators[i]之比。（对于i-gram）

BP的值也是各个句子级别BP的微平均。

p_numerators[i]和p_denominators[i]都是累加器，对每个源文本的若干参考译文和一个假设译文，通过modified_precision(references, hypothesis, i)计算一个准确率，它返回的是一个分数对象，包括分子和分母两部分。分子加到p_numerators[i]上，分母加到p_denominators[i]上。

那么modified_precision(references, hypothesis, i)具体是怎么计算的呢？

首先counts = Counter(ngrams(hypothesis, i))统计假设译文中不同i-gram的数量，counts内部计数之和即为返回的分母部分。

1
# 此处释疑：重复的i-gram如何纳入统计？以i=2为例子。
2

3
>>> from nltk.util import ngrams
4
>>> x = ngrams(['I', 'like', 'I', 'like'], 2)
5
>>> Counter(x)
6
Counter({('I', 'like'): 2, ('like', 'I'): 1})

然后统计若干参考译文中不同i-gram的数量，得到的Counter逐key取最大值。得到max_counts。

最后和counts取交集

1
clipped_counts = {
2
        ngram: min(count, max_counts[ngram]) for ngram, count in counts.items()
3
    }

得到交集内所有计数之和即返回的分子部分。

Smooth

其他：

BLEU 分值范围是 0 ~ 1，分值越高表示译文质量越好。

BP的范围是 0 ~ 1。

本文参考自nltk官方文档：

https://www.nltk.org/api/nltk.translate.bleu_score.html#nltk.translate.bleu_score.corpus_bleu