Method for distinguishing between word and non-words

Question

I'm working with the Stack exchange data dump and attempting to identify unique and novel words in the corpus. I'm doing this be referencing a very large wordlist and extracting the words not present in my reference word list.

The problem I am running up against is a number of the unique tokens are non-words, like directory names, error codes, and other strings.

Is there a good method of identifying differentiating word-like strings from non-word-like strings?

I'm using NLTK, but am not limited to that toolkit.

Method for distinguishing between word and non-words

Answers (1)

Related Questions