marin
marin

Reputation: 953

Create list of bigrams with all the words in a list

I am trying to create bigrams between all the words in a list. I can create bigrams with words + co-currents. But I can't combine one word with all the others on the list.

Script:

countries = ['france', 'brazil', 'usa', 'uk', 'canada', 'mexico', 'angola']

countries_bigrams = zip(*[countries[i:] for i in range(2)])

print(list(countries_bigrams))

My output:

[('france', 'brazil'), ('brazil', 'usa'), ('usa', 'uk'), ('uk', 'canada'), ('canada', 'mexico'), ('mexico', 'angola')]

Good output:

[('france', 'brazil'), ('france', 'usa'), ('france', 'uk'), ('france', 'canada'), ('france', 'mexico'), ('france', 'angola'), ('brazil', 'france'), ('brazil', 'usa'), ('brazil', 'uk'), ('brazil', 'canada'), ('brazil', 'mexico'), ('brazil', 'angola'), ...]

Upvotes: 0

Views: 253

Answers (3)

Nouman
Nouman

Reputation: 7303

You can simply use list comprehension:

[(c1, c2) for c1 in countries for c2 in countries if c1 != c2]

Output:

[('france', 'brazil'),
 ('france', 'usa'),
 ('france', 'uk'),
 ('france', 'canada'),
 ('france', 'mexico'),
 ('france', 'angola'),
 ('brazil', 'france'),
 ('brazil', 'usa'),
 ('brazil', 'uk'),
 ('brazil', 'canada'),
 ('brazil', 'mexico'),
 ('brazil', 'angola'),
 ('usa', 'france'),
 ('usa', 'brazil'),
 ('usa', 'uk'),
 ('usa', 'canada'),
 ('usa', 'mexico'),
 ('usa', 'angola'),
 ('uk', 'france'),
 ('uk', 'brazil'),
 ('uk', 'usa'),
 ('uk', 'canada'),
 ('uk', 'mexico'),
 ('uk', 'angola'),
 ('canada', 'france'),
 ('canada', 'brazil'),
 ('canada', 'usa'),
 ('canada', 'uk'),
 ('canada', 'mexico'),
 ('canada', 'angola'),
 ('mexico', 'france'),
 ('mexico', 'brazil'),
 ('mexico', 'usa'),
 ('mexico', 'uk'),
 ('mexico', 'canada'),
 ('mexico', 'angola'),
 ('angola', 'france'),
 ('angola', 'brazil'),
 ('angola', 'usa'),
 ('angola', 'uk'),
 ('angola', 'canada'),
 ('angola', 'mexico')]

Upvotes: 1

FabioL
FabioL

Reputation: 999

You need to calculate combination of size 2. A good method is using itertools:

import itertools
countries = ['france', 'brazil', 'usa', 'uk', 'canada', 'mexico', 'angola']
list(itertools.combinations(countries, 2))

that give as output:

[('france', 'brazil'),
 ('france', 'usa'),
 ('france', 'uk'),
 ('france', 'canada'),
 ('france', 'mexico'),
 ('france', 'angola'),
 ('brazil', 'usa'),
 ('brazil', 'uk'),
 ('brazil', 'canada'),
 ('brazil', 'mexico'),
 ('brazil', 'angola'),
 ('usa', 'uk'),
 ('usa', 'canada'),
 ('usa', 'mexico'),
 ('usa', 'angola'),
 ('uk', 'canada'),
 ('uk', 'mexico'),
 ('uk', 'angola'),
 ('canada', 'mexico'),
 ('canada', 'angola'),
 ('mexico', 'angola')]

I think that is one of the faster method, I compered the performance on my machine:

%%timeit
list(itertools.combinations(countries, 2))

takes around 1.28 µs

%%timeit
[(c1, c2) for c1 in countries for c2 in countries if c1 != c2]

takes around 4.81 µs

%%timeit
countries_bigrams = [(x, y) for x, y in product(countries, repeat=2) if x != y]

takes around 5.81 µs

Upvotes: 2

Djaouad
Djaouad

Reputation: 22776

You can use itertools.product to get the Cartesian product of the list with itself and remove tuples of duplicate values using a list comprehension:

from itertools import product

countries_bigrams = [(x, y) for x, y in product(countries, repeat=2) if x != y]

print(countries_bigrams)

Output:

[('france', 'brazil'), ('france', 'usa'), ('france', 'uk'), ('france', 'canada'), ('france', 'mexico'), ('france', 'angola'), ('brazil', 'france'), ('brazil', 'usa'), ('brazil', 'uk'), ('brazil', 'canada'), ('brazil', 'mexico'), ('brazil', 'angola'), ('usa', 'france'), ('usa', 'brazil'), ('usa', 'uk'), ('usa', 'canada'), ('usa', 'mexico'), ('usa', 'angola'), ('uk', 'france'), ('uk', 'brazil'), ('uk', 'usa'), ('uk', 'canada'), ('uk', 'mexico'), ('uk', 'angola'), ('canada', 'france'), ('canada', 'brazil'), ('canada', 'usa'), ('canada', 'uk'), ('canada', 'mexico'), ('canada', 'angola'), ('mexico', 'france'), ('mexico', 'brazil'), ('mexico', 'usa'), ('mexico', 'uk'), ('mexico', 'canada'), ('mexico', 'angola'), ('angola', 'france'), ('angola', 'brazil'), ('angola', 'usa'), ('angola', 'uk'), ('angola', 'canada'), ('angola', 'mexico')]

Note that this is not getting the bigrams, your original approach already gets you the bigrams.

Upvotes: 2

Related Questions