Sam
Sam

Reputation: 1834

Is there anyway to use cognitive services to detect if a string contains words vs just junk shift chars/gibberish?

I'm trying to find a way to use cognitive services to detect if a string contains a piece of coherent text or is just junk. Example:

SDF#%# ASFSDS b

vs

Hi my name is Sam.

This seems impossible to do. I had the idea of running the text through the keywords text analysis (which would give me a keyword of ASDSDS (how useful!)) and then run that keyword though the Bing Spell Check. I'm not sure what is going on in the the USA but it seems ASFSDS is English. It really is quite... erm.. dumb.

I've tried running similar text through a bunch of services (like language detection) and they all seem convinced that my gibberish samples are 100% coherent English.

I'm going to quiz an MS rep about it on Friday but I was wondering if anyone has achieved something like this using Cognitive services?

Upvotes: 0

Views: 336

Answers (1)

cthrash
cthrash

Reputation: 2973

Rather than a binary is-word-or-not question, what you might consider instead is the probability of a word being gibberish. You can then choose a threshold that you like.

For computing word probalities, you might try the Web Language Model API. You could look at the joint probability, as an example. For your set of words, the response looks as follows (values for the body corpus):

{
  "results": [
    {
      "words": "sdf#%#",
      "probability": -12.215
    },
    {
      "words": "asfsds",
      "probability": -12.215
    },
    {
      "words": "b",
      "probability": -3.127
    },
    {
      "words": "hi",
      "probability": -3.905
    },
    {
      "words": "my",
      "probability": -2.528
    },
    {
      "words": "name",
      "probability": -3.128
    },
    {
      "words": "is",
      "probability": -2.201
    },
    {
      "words": "sam.",
      "probability": -12.215
    },
    {
      "words": "sam",
      "probability": -4.431
    }
  ]
}

You will notice a couple of idiosyncrasies:

  1. Probabilities are negative. This is because they are logarithmic.
  2. All terms are case-folded. This means that the corpus won't distinguish between, say, GOAT and goat.
  3. Caller must perform a certain amount of normalization themselves (note probability of sam. vs sam)
  4. Corpora are only available for the en-us market. This could be problematic depending on your use case.

An advanced use case would be computing conditional probabilities, i.e. the probability of a word in the context of words preceding it.

Upvotes: 1

Related Questions