David
David

Reputation: 39

Does NLTK provide a lib to measure vocabulary ordinary level?

Does NLTK or any other NLP tools provide a lib to measure vocabulary ordinary level?

By that ordinary level, I mean certain words are simple and more frequently used like "and, age, yes, this, those, kind", which any elementary school student must know. Similar to that Longman English Dictionary (usually for ESL) has defined a 3000-word basic vocabulary for explaining all the entries with.

There could be a set of rare words that fall into the rare-used level, which only pedantic uses, like Agastopia, Impignorate, Gobbledygook, etc.

There are for sure some levels in between of these 2 extremes. Certainly, this level definition is purely subjective and I expect different organizations or persons may have different views. At least it could vary region from region.

My purpose is to measure the difficulty/complexity of some passages, well, currently naively, by just checking its vocabulary.

"Ordinary level' might not be the good description, but I am not able find a proper and formal expression :). I hope my explanation clarifies my purpose.

Upvotes: 1

Views: 300

Answers (1)

DBaker
DBaker

Reputation: 2139

An empirical approach to this problem is to use the term frequencies in a large corpus of documents. Using most of English wikipedia, I have created a dictionary of term frequencies (which can be downloaded here)

import pickle
with open('/home/user/data/enWikipediaDictTermCounts.pickle', 'rb') as handle:
    d = pickle.load(handle)

#common words will have high counts (they appear many times in wikipedia):

d.get('age',0)
#207669
d.get('kind',0)
#62302

#rare words will have low counts:

d.get('agastopia',0)
#1
d.get('gobbledygook',0)
#39
d.get('serendipitous',0)
#186

Rare words will appear fewer that 500 times and common words will appear more than 10K times. You can play with these thresholds to find the right level of rarety (resp. commonness) for your application.
remark: note that all words have been converted to lowercase in the dictionary

Upvotes: 2

Related Questions