Reputation: 161
I am using EMNIST as a dataset for a text detection and recognition using deep learning. I downloaded the datasets from https://pypi.org/project/emnist/ (using pip install emnist
). The datasets are from https://www.nist.gov/itl/products-and-services/emnist-dataset it describes them as follows:
EMNIST ByClass: 814,255 characters. 62 unbalanced classes.
EMNIST ByMerge: 814,255 characters. 47 unbalanced classes.
EMNIST Balanced: 131,600 characters. 47 balanced classes.
EMNIST Letters: 145,600 characters. 26 balanced classes.
EMNIST Digits: 280,000 characters. 10 balanced classes.
EMNIST MNIST: 70,000 characters. 10 balanced classes.
Most of these make sense for example 62 classes is made up of 10 digits, 26 capital letters and 26 lower case. But for ByMerge and Balanced we have 47.
I have looked into the data myself and find 10 digits, 26 letters (mixture of uppercase and lowercase) and then as far as I can tell the remaining 11 are random lowercase letters ('a','b','d','e','f','g','h','n','q','r','t').
Does anyone know why these extra 11 have been specifically included?
Upvotes: 5
Views: 3513
Reputation: 35
I'm not sure if this is the correct answer, but here's my guess. Characters such as "C" or "S" have very similar looking uppercase and lowercase letters. Even for humans if you see a single "C" or "S" by itself it can be hard to differentiate between uppercase and lowercase letters. This is why I believe that the creators of the ByMerge split of EMNIST decided to exclude letters like that and only include letters like "A" or "R", which look very different from their lowercase counterparts.
For reference:
A, B, C, D, E, F, G, H, I, J, K
a, b, c, d, e, f, g, h, i, j, k
Some of these letters look very similar(e.g. C and K) whereas some others don't(e.g. b and g).
Upvotes: 3
Reputation: 161
I have since found an answer to this question by looking into the paper EMNIST: an extension of MNIST to handwritten letters by G. Cohen (available here: https://arxiv.org/pdf/1702.05373v1.pdf).
This explains that many letters have problems in character recognition that the upper and lower case variants are very similar. This causes problems in trying to classify these letters. To counteract this they have merged the letters they thought this was a problem for.
From the paper:
The merged classes, as suggested by the NIST, are for the letters C, I, J, K, L, M, O, P, S, U, V, W, X, Y and Z.
This accounts for the missing classes (although I would have liked to see a 62 balanced class option or a 36 class option with all the letters merged).
Upvotes: 5