Reputation: 39
Assume that i have a data that looks like
['<s>', 'I' , '<s>', 'I', 'UNK', '</s>']
I would like to get the number of bigram that occurs only once, so
n1 == ('I', '<s>'), ('I', 'UNK'), ('UNK', '</s>')
len(n1) == 3
and number of bigram that occurs twice
n2 == ('<s>', 'I')
len(n2) == 1
I am thinking of storing the first word as sen[i] and the next word as sen[i + 1] but I am not sure if this is the right approach.
Upvotes: 1
Views: 6588
Reputation: 349
Considering your list:-
lis = ['<s>', 'I' , '<s>', 'I', 'UNK', '</s>']
loop over the list to generate the tuples of bigrams and keep getting their frequency into the dictionary like this:-
bigram_freq = {}
length = len(lis)
for i in range(length-1):
bigram = (lis[i], lis[i+1])
if bigram not in bigram_freq:
bigram_freq[bigram] = 0
bigram_freq[bigram] += 1
Now, collect the bigrams with frequency = 1 and frequency = 2 like this:-
bigrams_with_frequency_one = 0
bigrams_with_frequency_two = 0
for bigram in bigram_freq:
if bigram_freq[bigram] == 1:
bigrams_with_frequency_one += 1
elif bigram_freq[bigram] == 2:
bigrams_with_frequency_two += 1
you have bigrams_with_frequency_one and bigrams_with_frequency_two as your results. I hope it helps!
Upvotes: 1
Reputation: 744
You can try this:
my_list = ['<s>', 'I' , '<s>', 'I', 'UNK', '</s>']
bigrams = [(l[i-1], l[i]) for i in range(1, len(my_list))]
print(bigrams)
# [('<s>', 'I'), ('I', '<s>'), ('<s>', 'I'), ('I', 'UNK'), ('UNK', '</s>')]
d = {}
for c in set(bigrams):
count = bigrams.count(c)
d.setdefault(count, []).append(c)
print(d)
# {1: [('I', '<s>'), ('UNK', '</s>'), ('I', 'UNK')], 2: [('<s>', 'I')]}
Upvotes: 0