Marisa
Marisa

Reputation: 1183

How to make many characteres to be just one in Python?

I have the following sentence:

sentence_1 = "online auto body" 

And I have added at the beginning and at the end of it the following character <s> indicating the start and the end of it, therefore my sentence is now as follows:

sentence = "<s> online auto body <s>" 

I would like to make trigrams of the words in sentence_1 as follows:

('<s>','o','n')
('o', 'n', 'l')
('n', 'l', 'i')
('l', 'i', 'n')
('i', 'n', 'e')
('a', 'u', 't')
('u', 't', 'o')
('b', 'o', 'd')
('o', 'd', 'y')
('d', 'y', '<s>)

What I tried to do so is the following piece of code:

from nltk import ngrams
n = 3
word_3grams = ngrams(sentence.split(), n)


for w_grams in word_3grams:
    w_gram = list(w_grams)
    print(w_grams[0])
    for i in range(0,n):
        letter_3grams = ngrams(w_grams[i],3)
        for l_gram in letter_3grams:
            print(l_gram)

But what I get is:

('<', 's', '>')
('o', 'n', 'l')
('n', 'l', 'i')
('l', 'i', 'n')
('i', 'n', 'e')
('a', 'u', 't')
('u', 't', 'o')

And so on.

The question is how could I avoid the splitting in 3-grams of <s> and take it as a whole?

Upvotes: 1

Views: 70

Answers (2)

pylang
pylang

Reputation: 44585

Use the third-party more_itertools.stagger tool (install via > pip install more_itertools):

Code

import more_itertools as mit


sentence_1 = "online auto body" 
s = "".join(sentence_1)

list(mit.stagger(s, fillvalue="<s>", longest=True))[:-1]

Output

[('<s>', 'o', 'n'),
 ('o', 'n', 'l'),
 ('n', 'l', 'i'),
 ('l', 'i', 'n'),
 ('i', 'n', 'e'),
 ('n', 'e', 'a'),
 ('e', 'a', 'u'),
 ('a', 'u', 't'),
 ('u', 't', 'o'),
 ('t', 'o', 'b'),
 ('o', 'b', 'o'),
 ('b', 'o', 'd'),
 ('o', 'd', 'y'),
 ('d', 'y', '<s>')]

This tool yields tuples with items offset from the input iterable. The trailing offsets are replaced by the fillvalue parameter.

Upvotes: 0

sciroccorics
sciroccorics

Reputation: 2427

The desired output shows that spaces are removed in your input string, so don't forget to replace spaces by an empty string before splitting:

sentence_1 = "online auto body"

lst = ['<s>'] + list(sentence_1.replace(' ','')) + ['<s>']
tri = [tuple(lst[n:n+3]) for n in range(len(lst)-2)]
print(tri)

This code creates a list of trigrams, that you may process further:

[('<s>', 'o', 'n'), ('o', 'n', 'l'), ('n', 'l', 'i'), ('l', 'i', 'n'), ('i', 'n', 'e'), ('n', 'e', 'a'), ('e', 'a', 'u'), ('a', 'u', 't'), ('u', 't', 'o'), ('t', 'o', 'b'), ('o', 'b', 'o'), ('b', 'o', 'd'), ('o', 'd', 'y'), ('d', 'y', '<s>')]

If you only want to print the trigrams, replace the last two lines by:

print('\n'.join(str(tuple(lst[n:n+3])) for n in range(len(lst)-2)))

Output:

('<s>', 'o', 'n')
('o', 'n', 'l')
('n', 'l', 'i')
('l', 'i', 'n')
('i', 'n', 'e')
('n', 'e', 'a')
('e', 'a', 'u')
('a', 'u', 't')
('u', 't', 'o')
('t', 'o', 'b')
('o', 'b', 'o')
('b', 'o', 'd')
('o', 'd', 'y')
('d', 'y', '<s>')

Upvotes: 2

Related Questions