From "descriptions with typos" to "labels"

Question

Background

I do have an image dataset (similar to ImageNet) which comes with a "description with typos" per each image. I would like to run some deep convolutional neural network on this guy, but I need to generate the "labels" first. So, here's the question:

Question

How to generate categories' "label" from "descriptions with typos"?

Technical information

The dataset has around 13M images with corresponding (valid) "description" and optional "typos". Some examples of "descriptions" follow below:

First example Second example

Ideas

I was thinking to approach the problem in the following way.

Fix typos:
- Run a spell check to identify spelling errors;
- Find the better word that could fix it, by
  - looking at other descriptions in the dataset, or
  - checking the image and correcting the typo manually;
Generate the final labels:
- run a clustering algorithm (k-means, for example) on a sentence embedding (function that maps sentences into a ℝᴺ) or
- use the most recurrent words.

igarciad · Accepted Answer

Here some ideas:

You should clearly run a spell checking, otherwise your labels will be even more noisy. Options:
- Check a Information retrieval course and implement the checking, google lecture3-tolerant-retrieval-handout-6-per.pdf (I bet this is not the way to go) In case you want frequencies, google "Natural Language Corpus Data"
- Use some code http://norvig.com/spell-correct.html (in many languages)
Regarding labeling (I guess you want it automatically otherwise there are semi automatic methods):

I hope this can be useful

From "descriptions with typos" to "labels"

Background

Question

Technical information

Ideas

Answers (1)

Related Questions

From &quot;descriptions with typos&quot; to &quot;labels&quot;

Background

Question

Technical information

Ideas

Answers (1)

Related Questions

From "descriptions with typos" to "labels"