Why do the EMNIST ByMerge and Balanced datasets have exactly 47 classes each?

Question

I am using EMNIST as a dataset for a text detection and recognition using deep learning. I downloaded the datasets from https://pypi.org/project/emnist/ (using pip install emnist). The datasets are from https://www.nist.gov/itl/products-and-services/emnist-dataset it describes them as follows:

EMNIST ByClass: 814,255 characters. 62 unbalanced classes.

EMNIST ByMerge: 814,255 characters. 47 unbalanced classes.

EMNIST Balanced: 131,600 characters. 47 balanced classes.

EMNIST Letters: 145,600 characters. 26 balanced classes.

EMNIST Digits: 280,000 characters. 10 balanced classes.

EMNIST MNIST: 70,000 characters. 10 balanced classes.

Most of these make sense for example 62 classes is made up of 10 digits, 26 capital letters and 26 lower case. But for ByMerge and Balanced we have 47.

I have looked into the data myself and find 10 digits, 26 letters (mixture of uppercase and lowercase) and then as far as I can tell the remaining 11 are random lowercase letters ('a','b','d','e','f','g','h','n','q','r','t').

Does anyone know why these extra 11 have been specifically included?

Daniel B · Accepted Answer

I have since found an answer to this question by looking into the paper EMNIST: an extension of MNIST to handwritten letters by G. Cohen (available here: https://arxiv.org/pdf/1702.05373v1.pdf).

This explains that many letters have problems in character recognition that the upper and lower case variants are very similar. This causes problems in trying to classify these letters. To counteract this they have merged the letters they thought this was a problem for.

From the paper:

The merged classes, as suggested by the NIST, are for the letters C, I, J, K, L, M, O, P, S, U, V, W, X, Y and Z.

This accounts for the missing classes (although I would have liked to see a 62 balanced class option or a 36 class option with all the letters merged).

Why do the EMNIST ByMerge and Balanced datasets have exactly 47 classes each?

Answers (2)

Related Questions