Reputation: 953
I am trying to create bigrams between all the words in a list. I can create bigrams with words + co-currents. But I can't combine one word with all the others on the list.
Script:
countries = ['france', 'brazil', 'usa', 'uk', 'canada', 'mexico', 'angola']
countries_bigrams = zip(*[countries[i:] for i in range(2)])
print(list(countries_bigrams))
My output:
[('france', 'brazil'), ('brazil', 'usa'), ('usa', 'uk'), ('uk', 'canada'), ('canada', 'mexico'), ('mexico', 'angola')]
Good output:
[('france', 'brazil'), ('france', 'usa'), ('france', 'uk'), ('france', 'canada'), ('france', 'mexico'), ('france', 'angola'), ('brazil', 'france'), ('brazil', 'usa'), ('brazil', 'uk'), ('brazil', 'canada'), ('brazil', 'mexico'), ('brazil', 'angola'), ...]
Upvotes: 0
Views: 253
Reputation: 7303
You can simply use list comprehension:
[(c1, c2) for c1 in countries for c2 in countries if c1 != c2]
Output:
[('france', 'brazil'), ('france', 'usa'), ('france', 'uk'), ('france', 'canada'), ('france', 'mexico'), ('france', 'angola'), ('brazil', 'france'), ('brazil', 'usa'), ('brazil', 'uk'), ('brazil', 'canada'), ('brazil', 'mexico'), ('brazil', 'angola'), ('usa', 'france'), ('usa', 'brazil'), ('usa', 'uk'), ('usa', 'canada'), ('usa', 'mexico'), ('usa', 'angola'), ('uk', 'france'), ('uk', 'brazil'), ('uk', 'usa'), ('uk', 'canada'), ('uk', 'mexico'), ('uk', 'angola'), ('canada', 'france'), ('canada', 'brazil'), ('canada', 'usa'), ('canada', 'uk'), ('canada', 'mexico'), ('canada', 'angola'), ('mexico', 'france'), ('mexico', 'brazil'), ('mexico', 'usa'), ('mexico', 'uk'), ('mexico', 'canada'), ('mexico', 'angola'), ('angola', 'france'), ('angola', 'brazil'), ('angola', 'usa'), ('angola', 'uk'), ('angola', 'canada'), ('angola', 'mexico')]
Upvotes: 1
Reputation: 999
You need to calculate combination of size 2. A good method is using itertools:
import itertools
countries = ['france', 'brazil', 'usa', 'uk', 'canada', 'mexico', 'angola']
list(itertools.combinations(countries, 2))
that give as output:
[('france', 'brazil'),
('france', 'usa'),
('france', 'uk'),
('france', 'canada'),
('france', 'mexico'),
('france', 'angola'),
('brazil', 'usa'),
('brazil', 'uk'),
('brazil', 'canada'),
('brazil', 'mexico'),
('brazil', 'angola'),
('usa', 'uk'),
('usa', 'canada'),
('usa', 'mexico'),
('usa', 'angola'),
('uk', 'canada'),
('uk', 'mexico'),
('uk', 'angola'),
('canada', 'mexico'),
('canada', 'angola'),
('mexico', 'angola')]
I think that is one of the faster method, I compered the performance on my machine:
%%timeit
list(itertools.combinations(countries, 2))
takes around 1.28 µs
%%timeit
[(c1, c2) for c1 in countries for c2 in countries if c1 != c2]
takes around 4.81 µs
%%timeit
countries_bigrams = [(x, y) for x, y in product(countries, repeat=2) if x != y]
takes around 5.81 µs
Upvotes: 2
Reputation: 22776
You can use itertools.product
to get the Cartesian product of the list with itself and remove tuples of duplicate values using a list comprehension:
from itertools import product
countries_bigrams = [(x, y) for x, y in product(countries, repeat=2) if x != y]
print(countries_bigrams)
Output:
[('france', 'brazil'), ('france', 'usa'), ('france', 'uk'), ('france', 'canada'), ('france', 'mexico'), ('france', 'angola'), ('brazil', 'france'), ('brazil', 'usa'), ('brazil', 'uk'), ('brazil', 'canada'), ('brazil', 'mexico'), ('brazil', 'angola'), ('usa', 'france'), ('usa', 'brazil'), ('usa', 'uk'), ('usa', 'canada'), ('usa', 'mexico'), ('usa', 'angola'), ('uk', 'france'), ('uk', 'brazil'), ('uk', 'usa'), ('uk', 'canada'), ('uk', 'mexico'), ('uk', 'angola'), ('canada', 'france'), ('canada', 'brazil'), ('canada', 'usa'), ('canada', 'uk'), ('canada', 'mexico'), ('canada', 'angola'), ('mexico', 'france'), ('mexico', 'brazil'), ('mexico', 'usa'), ('mexico', 'uk'), ('mexico', 'canada'), ('mexico', 'angola'), ('angola', 'france'), ('angola', 'brazil'), ('angola', 'usa'), ('angola', 'uk'), ('angola', 'canada'), ('angola', 'mexico')]
Note that this is not getting the bigrams, your original approach already gets you the bigrams.
Upvotes: 2