Atcold
Atcold

Reputation: 723

From "descriptions with typos" to "labels"

Background

I do have an image dataset (similar to ImageNet) which comes with a "description with typos" per each image. I would like to run some deep convolutional neural network on this guy, but I need to generate the "labels" first. So, here's the question:

Question

How to generate categories' "label" from "descriptions with typos"?

Technical information

The dataset has around 13M images with corresponding (valid) "description" and optional "typos". Some examples of "descriptions" follow below:

First example Second example

Ideas

I was thinking to approach the problem in the following way.

  1. Fix typos:
    • Run a spell check to identify spelling errors;
    • Find the better word that could fix it, by
      • looking at other descriptions in the dataset, or
      • checking the image and correcting the typo manually;
  2. Generate the final labels:
    • run a clustering algorithm (k-means, for example) on a sentence embedding (function that maps sentences into a ℝᴺ) or
    • use the most recurrent words.

Upvotes: 0

Views: 141

Answers (1)

igarciad
igarciad

Reputation: 36

Here some ideas:

  1. You should clearly run a spell checking, otherwise your labels will be even more noisy. Options:

    • Check a Information retrieval course and implement the checking, google lecture3-tolerant-retrieval-handout-6-per.pdf (I bet this is not the way to go) In case you want frequencies, google "Natural Language Corpus Data"

    • Use some code http://norvig.com/spell-correct.html (in many languages)

  2. Regarding labeling (I guess you want it automatically otherwise there are semi automatic methods):

    • Use http://viget.com/extend/tagging-text-automatically I have never used them but it should work reasonable well
    • I would not recommend using k means because you do know the number of groups
    • Use the most recurrent word might work for few examples (like the ones you show there) but it might not work for many cases.

I hope this can be useful

Upvotes: 1

Related Questions