Reputation: 1999
I want to parse a file that is between 300-2,000 words and create lists of words in groups of 1 to n words long. For example, if I had this file:
The fat cat sat on a mat.
The output for 1-2 would be:
# group of words, 1 word length
['The' 'fat' 'cat' 'sat' 'on' 'a' 'mat']
# group of words, 2 word length
[['The', 'fat'], ['fat'], 'cat', ['cat', 'sat'], ['sat', 'on'], ['on', 'a'], ['a', 'mat']]
I'm sure I could write some very inefficient code that could do this but I'm wondering if there is an NLP (or other) library that can do this for me.
Upvotes: 1
Views: 43
Reputation: 5802
In computational linguistics, we call them unigrams, bigrams, trigrams, etc., or n-grams in general. There is an ngrams()
function in the NLTK.
The first thing you need to do is tokenize, e.g., like this:
from nltk.tokenize import word_tokenize
# ['The', 'fat', 'cat', 'sat', 'on', 'a', 'mat', '.']
words = word_tokenize('The fat cat sat on a mat.')
Then you can get the ngrams like so:
from nltk import ngrams
bigrams = list(ngrams(words, 2))
which will give you
[('The', 'fat'), ('fat', 'cat'), ('cat', 'sat'), ('sat', 'on'), ('on', 'a'), ('a', 'mat'), ('mat', '.')]
Upvotes: 4