user10732053
user10732053

Reputation:

find bigram using gensim

this is my code

from gensim.models import Phrases
documents = ["the mayor of new york was there the hill have eyes","the_hill have_eyes new york mayor was present"]

sentence_stream = [doc.split(" ") for doc in documents]

bigram = Phrases(sentence_stream, min_count=1)
sent = ['the', 'mayor', 'of', 'new_york', 'was', 'there', 'the_hill', 'have_eyes']
print(bigram[sent])

i want it detects "the_hill_have_eyes" but the output is

['the', 'mayor', 'of', 'new_york', 'was', 'there', 'the_hill', 'have_eyes']

Upvotes: 0

Views: 828

Answers (2)

Harshal Parekh
Harshal Parekh

Reputation: 6017

What you want is not actually bigrams but "fourgrams".

This can be achieved by doing something like this (my old piece of code I wrote some months ago):

// read the txt file
sentences = Text8Corpus(datapath('testcorpus.txt'))
phrases = Phrases(sentences, min_count=1, threshold=1)

bigram = Phraser(phrases)
sent = [u'trees', u'graph', u'minors']
// look for words in "sent"
print(bigram[sent])
[u'trees_graph', u'minors'] // output

// to create the bigrams
bigram_model = Phrases(unigram_sentences)

// apply the trained model to a sentence
 for unigram_sentence in unigram_sentences:                
            bigram_sentence = u' '.join(bigram_model[unigram_sentence])

// get a trigram model out of the bigram
trigram_model = Phrases(bigram_sentences)

So here you have a trigram model (detecting 3 words together) and you get the idea on how to implement fourgrams.

Hope this helps. Good luck.

Upvotes: 1

gojomo
gojomo

Reputation: 54153

Phrases is a purely-statistical method for combining some unigram-token-pairs to new bigram-tokens. If it's not combining two unigrams you think should be combined, it's because the training data and/or chosen parameters (like threshold or min_count) don't imply that pairing should be combined.

Note especially that:

  • even when Phrases-combinations prove beneficial for downstream classification or info-retrieval steps, they may not intuitively/aesthetically match the "phrases" we as human readers would like to see

  • since Phrases requires bulk statistics for good results, it requires a lot of training data – you are unlikely to see impressive or representative results from tiny toy-sized training data

In particular with regard to that last point & your example, the interpretation of min_count in Phrases default-scoring means even a min_count=1 isn't low enough to cause bigrams for which there is only a single example in the training-corpus to be created.

So, if you expand your training-corpus a bit, you may be able to create the results you want. But you should still be aware that this method's only value comes from training are larger, realistic corpuses, so anything you see in tiny contrived examples may not generalize to real uses.

Upvotes: 1

Related Questions