Gerbal
Gerbal

Reputation: 671

Method for distinguishing between word and non-words

I'm working with the Stack exchange data dump and attempting to identify unique and novel words in the corpus. I'm doing this be referencing a very large wordlist and extracting the words not present in my reference word list.

The problem I am running up against is a number of the unique tokens are non-words, like directory names, error codes, and other strings.

Is there a good method of identifying differentiating word-like strings from non-word-like strings?

I'm using NLTK, but am not limited to that toolkit.

Upvotes: 1

Views: 207

Answers (1)

zelandiya
zelandiya

Reputation: 51

This is an interesting problem because it's so difficult to define what's makes a combination of characters a word. I would suggest to use supervised machine learning. First, you need take the current output from your program and annotate manually each example as word and non-word. Then, come up with some features, e.g.

  • number of characters
  • first three characters
  • last three characters
  • preceeding word
  • following word
  • ...

Then, use a library like sci-kit learn to create a training model that captures these differences and can predict the likelihood of "wordness" for any sequence of characters.

Potentially a one-class classifier would be useful here. But in any case prepare some data so that you can evaluate the accuracy of this or any other approach.

Upvotes: 2

Related Questions