Forming Bigrams of words in list of sentences and counting bigrams using python

Question

I need: 1. to form bigram pairs and store them in list 2. find sum of id in which there аrе top 3 bigram with highest frequency

I have a list of sentences:

[['22574999', 'your message communication sent']
, ['22582857', 'your message be delivered']
, ['22585166', 'message has be delivered']
, ['22585424', 'message originated communication sent']]

Here is what I did:

for row in messages: 
    sstrm = list(row)
    bigrams=[b for l in sstrm for b in zip(l.split(" ")[:1], l.split(" ")[1:])]
    print(sstrm[0],bigrams)

which yields:

22574999 [('your', 'message')]
22582857 [('[your', 'message')]
22585166 [('message', 'has')]
22585424 [('message', 'originated')]

What I want is:

22574999 [('your', 'message'),('communication','sent')]
22582857 [('[your', 'message'),('be','delivered')]
22585166 [('message', 'has'),('be','delivered')]
22585424 [('message', 'originated'),('communication','sent')]

I would like to get the following result RESULT:

top 3 bigrams with highest frequency:

('your', 'message') :2 
('communication','sent'):2    
('be','delivered'):2

sum of id in which there аре top 3 bigrams with highest frequency:

('your', 'message'):2           Is included (22574999,22582857)     
('communication','sent'):2      Is included(22574999,22585424)
('be','delivered'):2            Is included (22582857,22585166)

Thanks for your help!

Daniele Bacarella · Accepted Answer

First thing I'd like to point out is that bigrams are sequences of two adjacent elements.

For instance, the bigrams of "the fox jumped over the lazy dog" are:

[("the", "fox"),("fox", "jumped"),("jumped", "over"),("over", "the"),("the", "lazy"),("lazy", "dog")]

This problem can be modeled using an inverted index, where the bigrams are the postings and the set of ids are the posting lists.

def bigrams(line):
    tokens = line.split(" ")
    return [(tokens[i], tokens[i+1]) for i in range(0, len(tokens)-1)]


if __name__ == "__main__":
    messages = [['22574999', 'your message communication sent'], ['22582857', 'your message be delivered'], ['22585166', 'message has be delivered'], ['22585424', 'message originated communication sent']]
    bigrams_set = set()

    for row in messages:
        l_bigrams = bigrams(row[1])
        for bigram in l_bigrams:
            bigrams_set.add(bigram)

    inverted_idx = dict((b,[]) for b in bigrams_set)

    for row in messages:
        l_bigrams = bigrams(row[1])
        for bigram in l_bigrams:
            inverted_idx[bigram].append(row[0])

    freq_bigrams = dict((b,len(ids)) for b,ids in inverted_idx.items())
    import operator
    top3_bigrams = sorted(freq_bigrams.iteritems(), key=operator.itemgetter(1), reverse=True)[:3]

Output

[(('communication', 'sent'), 2), (('your', 'message'), 2), (('be', 'delivered'), 2)]

Although this code can be optimized by a great deal, it gives you the idea.

Forming Bigrams of words in list of sentences and counting bigrams using python

Answers (2)

Related Questions