Craig Zheng
Craig Zheng

Reputation: 453

How To Detect Is Text Human Readable?

I am wondering if there's a way to tell a given text is human readable. By human readable, I mean: it has some meanings, format like an article written by somebody, or at least generated by a software translator that is intended to be read by a human.

Here's the background story: recently I am making an app that allows user to upload a short text to a database. At the early stage of deployment I noticed some user always uploaded corrupted text due to a problem with encoding. This problem is fixed later, but leaves me wonder if there's a way to pick up non human readable text before serving the text back to users.

Any advice will be appreciated. The scope might be too large to include other languages, so at the moment let's limit the discussion to English only.

Upvotes: 3

Views: 1171

Answers (3)

Yisheng Jiang
Yisheng Jiang

Reputation: 140

Do a hexdump and make sure each character is less than or equal to 0x7f.

Upvotes: 0

Ulrich
Ulrich

Reputation: 309

Most of the NLP-Libraries will do the job (Spacy is a very common one). You can also go for language detection: Langdetect will support you on this (https://pypi.org/project/langdetect/) as many others will do. If you need to be less specific (more math than language) you should look for Phonotactics (with BLICK for Python: https://github.com/mmcauliffe/python-BLICK) that looks into the construction of character order in a string.

Upvotes: 0

Pierre
Pierre

Reputation: 1246

You can try a language identification tool, or something similar.

Basically you have to count the characters, or groups of character (character n-grams), and compare the distribution of the letters of the text submitted with the distribution of the letters of a collection of texts written in good english. (Make sure that such collection of texts is representative of the expected input).

In the continuity of a N-gram approach you might want to try a dictionary based approach and check for the presence of 'stop words' (e.g. 'the', 'a', 'an', 'of') in the input text.

Upvotes: 2

Related Questions