Abhishek Bhatia
Abhishek Bhatia

Reputation: 9806

Probability distribution of two lists of words

I have two lists of string:

(Pdb) word_list1
['first', 'sentence', 'ant', 'first', 'whatever']
(Pdb) word_list2
['second', 'second', 'heck', 'anything', 'youtube', 'gmail', 'hotmail']

I want to compute the probability distribution of the union of words for each of the two sets for each word.

(Pdb) print list(set(word_list1) | set(word_list2))
['hotmail', 'anything', 'sentence', 'maybe', 'youtube', 'whatever', 'ant', 'second', 'heck', 'gmail', 'first']
(Pdb) len(list(set(word_list1) | set(word_list2)))
11

So, I want two vectors of length 11, one for each wordlist.

Upvotes: 1

Views: 351

Answers (1)

Colonel Beauvel
Colonel Beauvel

Reputation: 31171

You need more a dictionary with 11 elements as a result, and go for Counter instead of set operations if you are looking for frequencies:

from collections import Counter

n   = len(l1) + len(l2)
dic = dict(Counter(l1) + Counter(l2))

# for the first list
{k:round(v/n,2) if k in l1 else 0 for k,v in dic.iteritems()}

#{'ant': 0.09,
# 'anything': 0,
# 'first': 0.18,
# 'gmail': 0,
# 'heck': 0,
# 'hotmail': 0,
# 'second': 0,
# 'sentence': 0.09,
# 'whatever': 0.09,
# 'youtube': 0}

Upvotes: 1

Related Questions