how to eliminate repeated bigrams from trigrams in python nltk

Question

I have generated bigrams, trigrams in different files.

Now I have bigrams as below.

high cpu
power supply
nexus 7000
..

Now I have trigrams as below.

high cpu due
power supply failure
..

There is possibility that for few phrases only bigrams are generated and trigrams may not give much meaning. But for few phrases like "high cpu due" the trigrams have much more meaning than bigrams.

So I want to eliminate the repeated bigrams which are already present in trigrams and retain only the bigrams which are not present in trigrams. I tried with the below code, its finding the bigrams present in trigrams but if not found its not giving the bigram back.

terms=['ios zone','ios zone firewall']
phrases = [
    z for z in terms if z not in [x for x in terms for y in terms if x in y and x != y]
]
print (phrases)

this returns ['ios', 'zone', 'firewall'] but if there is no match then it should return bigrams

pault · Accepted Answer

IIUC, you want to keep only bigrams that are not contained in any of the trigrams. One approach is to check for substring matches:

bigrams = [
    "high cpu",
    "power supply",
    "nexus 7000"
]

trigrams = [
    "high cpu due",
    "power supply failure"
]

new_bigrams = [b for b in bigrams if all(b not in t for t in trigrams)]
print(new_bigrams)
#['nexus 7000']

We build new_bigrams using a list comprehension that only adds bigrams if they are not contained in any of the trigrams. all(b not in t for t in trigrams) returns False if the bigram is a substring of any of the trigrams.

how to eliminate repeated bigrams from trigrams in python nltk

Answers (2)

Related Questions