Reputation: 43
I am working on an OCR-task and for evaluation purposes want to calculate a confusion matrix for my model. I want it to basically show how often a character is predicted correctly and how often it is predicted as other characters (and which ones!).
My problem currently is, that a simple pair-wise comparison is difficult due to string-size mismatches and/or additional/missing characters (mainly whitespaces). I was thinking about adding the information about how often a character would need to be inserted/deleted using the Levenshtein distance calculation algorithm, but I'm still unsure on how to handle that.
Are there any state-of-the-art approaches that are commonly used for this? I did some research, but couldn't find anything significant.
Upvotes: 0
Views: 121
Reputation: 548
You are looking for the Needleman-Wunsch algorithm, which can be used to find the optimal alignment between two sequences. Although it was originally designed for DNA sequence alignment, it's also effective for aligning noisy OCR results.
A recommended implementation is available in Microsoft's genalog library. You can find usage examples in their documentation. Here's a quick example:
from genalog.text import alignment
from genalog.text import anchor
gt_txt = "New York is big"
noise_txt = "New Yo rkis "
# Align using the anchor method
aligned_gt, aligned_noise = anchor.align_w_anchor(gt_txt, noise_txt, gap_char="@")
print(f"Aligned ground truth: {aligned_gt}")
print(f"Aligned noise: {aligned_noise}")
# Align using the basic alignment method
aligned_gt, aligned_noise = alignment.align(gt_txt, noise_txt, gap_char="@")
print(f"Aligned ground truth: {aligned_gt}")
print(f"Aligned noise: {aligned_noise}")
This will output:
Aligned ground truth: New Yo@rk is big
Aligned noise: New Yo rk@is @@@
After alignment, compare character by character to compute a confusion matrix. For more details, visit the genalog documentation on text alignment.
Upvotes: 0