Alex
Alex

Reputation: 1999

Parse a File To Get Set Of Words - May Be NLP Related?

I want to parse a file that is between 300-2,000 words and create lists of words in groups of 1 to n words long. For example, if I had this file:

 The fat cat sat on a mat.

The output for 1-2 would be:

# group of words, 1 word length
['The' 'fat' 'cat' 'sat' 'on' 'a' 'mat'] 

 # group of words, 2 word length
 [['The', 'fat'], ['fat'], 'cat', ['cat', 'sat'], ['sat', 'on'], ['on', 'a'], ['a', 'mat']]

I'm sure I could write some very inefficient code that could do this but I'm wondering if there is an NLP (or other) library that can do this for me.

Upvotes: 1

Views: 43

Answers (1)

fsimonjetz
fsimonjetz

Reputation: 5802

In computational linguistics, we call them unigrams, bigrams, trigrams, etc., or n-grams in general. There is an ngrams() function in the NLTK.

The first thing you need to do is tokenize, e.g., like this:

from nltk.tokenize import word_tokenize

# ['The', 'fat', 'cat', 'sat', 'on', 'a', 'mat', '.']
words = word_tokenize('The fat cat sat on a mat.')

Then you can get the ngrams like so:

from nltk import ngrams

bigrams = list(ngrams(words, 2))

which will give you

[('The', 'fat'), ('fat', 'cat'), ('cat', 'sat'), ('sat', 'on'), ('on', 'a'), ('a', 'mat'), ('mat', '.')]

Upvotes: 4

Related Questions