Reputation: 723
I do have an image dataset (similar to ImageNet) which comes with a "description with typos" per each image. I would like to run some deep convolutional neural network on this guy, but I need to generate the "labels" first. So, here's the question:
How to generate categories' "label" from "descriptions with typos"?
The dataset has around 13M images with corresponding (valid) "description" and optional "typos". Some examples of "descriptions" follow below:
I was thinking to approach the problem in the following way.
Upvotes: 0
Views: 141
Reputation: 36
Here some ideas:
You should clearly run a spell checking, otherwise your labels will be even more noisy. Options:
Check a Information retrieval course and implement the checking, google lecture3-tolerant-retrieval-handout-6-per.pdf (I bet this is not the way to go) In case you want frequencies, google "Natural Language Corpus Data"
Use some code http://norvig.com/spell-correct.html (in many languages)
Regarding labeling (I guess you want it automatically otherwise there are semi automatic methods):
I hope this can be useful
Upvotes: 1