Reputation: 1183
I have the following sentence:
sentence_1 = "online auto body"
And I have added at the beginning and at the end of it the following character <s>
indicating the start and the end of it, therefore my sentence is now as follows:
sentence = "<s> online auto body <s>"
I would like to make trigrams of the words in sentence_1
as follows:
('<s>','o','n')
('o', 'n', 'l')
('n', 'l', 'i')
('l', 'i', 'n')
('i', 'n', 'e')
('a', 'u', 't')
('u', 't', 'o')
('b', 'o', 'd')
('o', 'd', 'y')
('d', 'y', '<s>)
What I tried to do so is the following piece of code:
from nltk import ngrams
n = 3
word_3grams = ngrams(sentence.split(), n)
for w_grams in word_3grams:
w_gram = list(w_grams)
print(w_grams[0])
for i in range(0,n):
letter_3grams = ngrams(w_grams[i],3)
for l_gram in letter_3grams:
print(l_gram)
But what I get is:
('<', 's', '>')
('o', 'n', 'l')
('n', 'l', 'i')
('l', 'i', 'n')
('i', 'n', 'e')
('a', 'u', 't')
('u', 't', 'o')
And so on.
The question is how could I avoid the splitting in 3-grams of <s>
and take it as a whole?
Upvotes: 1
Views: 70
Reputation: 44585
Use the third-party more_itertools.stagger
tool (install via > pip install more_itertools
):
Code
import more_itertools as mit
sentence_1 = "online auto body"
s = "".join(sentence_1)
list(mit.stagger(s, fillvalue="<s>", longest=True))[:-1]
Output
[('<s>', 'o', 'n'),
('o', 'n', 'l'),
('n', 'l', 'i'),
('l', 'i', 'n'),
('i', 'n', 'e'),
('n', 'e', 'a'),
('e', 'a', 'u'),
('a', 'u', 't'),
('u', 't', 'o'),
('t', 'o', 'b'),
('o', 'b', 'o'),
('b', 'o', 'd'),
('o', 'd', 'y'),
('d', 'y', '<s>')]
This tool yields tuples with items offset from the input iterable. The trailing offsets are replaced by the fillvalue
parameter.
Upvotes: 0
Reputation: 2427
The desired output shows that spaces are removed in your input string, so don't forget to replace spaces by an empty string before splitting:
sentence_1 = "online auto body"
lst = ['<s>'] + list(sentence_1.replace(' ','')) + ['<s>']
tri = [tuple(lst[n:n+3]) for n in range(len(lst)-2)]
print(tri)
This code creates a list of trigrams, that you may process further:
[('<s>', 'o', 'n'), ('o', 'n', 'l'), ('n', 'l', 'i'), ('l', 'i', 'n'), ('i', 'n', 'e'), ('n', 'e', 'a'), ('e', 'a', 'u'), ('a', 'u', 't'), ('u', 't', 'o'), ('t', 'o', 'b'), ('o', 'b', 'o'), ('b', 'o', 'd'), ('o', 'd', 'y'), ('d', 'y', '<s>')]
If you only want to print the trigrams, replace the last two lines by:
print('\n'.join(str(tuple(lst[n:n+3])) for n in range(len(lst)-2)))
Output:
('<s>', 'o', 'n')
('o', 'n', 'l')
('n', 'l', 'i')
('l', 'i', 'n')
('i', 'n', 'e')
('n', 'e', 'a')
('e', 'a', 'u')
('a', 'u', 't')
('u', 't', 'o')
('t', 'o', 'b')
('o', 'b', 'o')
('b', 'o', 'd')
('o', 'd', 'y')
('d', 'y', '<s>')
Upvotes: 2