David M
David M

Reputation: 2943

Determine if a body of text contains valid words or just "gibberish"

I'm interested in ideas for identifying whether any given body of text contains valid, actual words, or just gibberish text.

The problem I run into immediately is that it needs to be language-agnostic, as the data we deal with is highly international. This means either a statistical approach, or an extremely large, multi-lingual hash table approach.

The multi-lingual hash tables seem straightforward, but unwieldy and possibly quite slow. (Or at the very least, a compromise between speed and accuracy.)

However, I don't really have a background in the statistical approaches that would be useful to me in this situation, and would very much appreciate anyone's experience or input, or any other suggestions.

Upvotes: 3

Views: 2470

Answers (2)

kevinc
kevinc

Reputation: 636

Do you know or can you determine the language of the document? I don't think loading a dictionary for a single language and calculating the % of valid words would be inordinately slow or memory intensive.

How accurate does it need to be?

Upvotes: 1

Jeff Foster
Jeff Foster

Reputation: 44696

You could use ngram analysis to compare your text with an example text. This could either be on characters or words.

Google's NGram Viewer can help visualize what I mean. As an example, if I search for "haddock refrigerator" then there are no occurrences (e.g. it's gibberish), whereas "stack overflow" shows occurrences came into prominence once computers did.

Upvotes: 2

Related Questions