Reputation: 11
I would like to have unordered bigrams for example: "the cat sat on the mat"
[("cat","the"),("cat","sat"),("on","sat"),("on","the"),("mat","the")]
each bigram is ordered in alphabetical order - this means, for example, "to house to" will give [("house", "to"),("house","to")]
which will give a higher frequency for these bigrams whilst minimising the search space.
I am able to get the above using:
unordered_bigrams = [tuple(sorted(n)) for n in list(nltk.bigrams(words))]
But I would now like to have a "bag-of-words" type vector for these.
I have ordered bigram feature vectors using:
o_bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
So would like the same for my unordered bigrams... I'm struggling to find an option in CountVectorizer that can give me this processing option (I've looked at vocabulary and preprocessor without much luck)
Upvotes: 1
Views: 688
Reputation: 122148
You don't really need a bigram generator if all you need are pairs of possible words given an unordered list of words:
>>> from itertools import permutations
>>> words = set("the cat sat on the mat".split())
>>> list(permutations(words, 2))
[('on', 'the'), ('on', 'sat'), ('on', 'mat'), ('on', 'cat'), ('the', 'on'), ('the', 'sat'), ('the', 'mat'), ('the', 'cat'), ('sat', 'on'), ('sat', 'the'), ('sat', 'mat'), ('sat', 'cat'), ('mat', 'on'), ('mat', 'the'), ('mat', 'sat'), ('mat', 'cat'), ('cat', 'on'), ('cat', 'the'), ('cat', 'sat'), ('cat', 'mat')]
Or if you don't want duplicated tuples with the same words but of different order:
>>> from itertools import combinations
>>> list(combinations(words, 2))
[('on', 'the'), ('on', 'sat'), ('on', 'mat'), ('on', 'cat'), ('the', 'sat'), ('the', 'mat'), ('the', 'cat'), ('sat', 'mat'), ('sat', 'cat'), ('mat', 'cat')]
There's a good answer on product
, combination
and permutation
on https://stackoverflow.com/a/942551/610569
Upvotes: 1