Reputation: 671
I'm working with the Stack exchange data dump and attempting to identify unique and novel words in the corpus. I'm doing this be referencing a very large wordlist and extracting the words not present in my reference word list.
The problem I am running up against is a number of the unique tokens are non-words, like directory names, error codes, and other strings.
Is there a good method of identifying differentiating word-like strings from non-word-like strings?
I'm using NLTK, but am not limited to that toolkit.
Upvotes: 1
Views: 207
Reputation: 51
This is an interesting problem because it's so difficult to define what's makes a combination of characters a word. I would suggest to use supervised machine learning. First, you need take the current output from your program and annotate manually each example as word and non-word. Then, come up with some features, e.g.
Then, use a library like sci-kit learn to create a training model that captures these differences and can predict the likelihood of "wordness" for any sequence of characters.
Potentially a one-class classifier would be useful here. But in any case prepare some data so that you can evaluate the accuracy of this or any other approach.
Upvotes: 2