Reputation: 2943
I'm interested in ideas for identifying whether any given body of text contains valid, actual words, or just gibberish text.
The problem I run into immediately is that it needs to be language-agnostic, as the data we deal with is highly international. This means either a statistical approach, or an extremely large, multi-lingual hash table approach.
The multi-lingual hash tables seem straightforward, but unwieldy and possibly quite slow. (Or at the very least, a compromise between speed and accuracy.)
However, I don't really have a background in the statistical approaches that would be useful to me in this situation, and would very much appreciate anyone's experience or input, or any other suggestions.
Upvotes: 3
Views: 2470
Reputation: 636
Do you know or can you determine the language of the document? I don't think loading a dictionary for a single language and calculating the % of valid words would be inordinately slow or memory intensive.
How accurate does it need to be?
Upvotes: 1
Reputation: 44696
You could use ngram analysis to compare your text with an example text. This could either be on characters or words.
Google's NGram Viewer can help visualize what I mean. As an example, if I search for "haddock refrigerator" then there are no occurrences (e.g. it's gibberish), whereas "stack overflow" shows occurrences came into prominence once computers did.
Upvotes: 2