Reputation: 243
I have generated bigrams, trigrams in different files.
Now I have bigrams as below.
high cpu
power supply
nexus 7000
..
Now I have trigrams as below.
high cpu due
power supply failure
..
There is possibility that for few phrases only bigrams are generated and trigrams may not give much meaning. But for few phrases like "high cpu due"
the trigrams have much more meaning than bigrams.
So I want to eliminate the repeated bigrams which are already present in trigrams and retain only the bigrams which are not present in trigrams. I tried with the below code, its finding the bigrams present in trigrams but if not found its not giving the bigram back.
terms=['ios zone','ios zone firewall']
phrases = [
z for z in terms if z not in [x for x in terms for y in terms if x in y and x != y]
]
print (phrases)
this returns ['ios', 'zone', 'firewall']
but if there is no match then it should return bigrams
Upvotes: 0
Views: 1517
Reputation: 2069
To add up to @pault answer.
When you run a finder, you get the trigrams/bigrams as a list of lists of strings.
In order for @pault technique to work, you have to join those lists, such as:
bigrams = finder.nbest(bigram_measures.pmi, 200)
trigrams = tfinder.nbest(trigram_measures.pmi, 200)
trigrams= [" ".join(t) for t in trigrams]
bigrams= [" ".join(b) for b in bigrams]
And finally, @pault answer:
bigrams= [b for b in bigrams if all(b not in t for t in trigrams)]
Upvotes: 0
Reputation: 43504
IIUC, you want to keep only bigrams that are not contained in any of the trigrams. One approach is to check for substring matches:
bigrams = [
"high cpu",
"power supply",
"nexus 7000"
]
trigrams = [
"high cpu due",
"power supply failure"
]
new_bigrams = [b for b in bigrams if all(b not in t for t in trigrams)]
print(new_bigrams)
#['nexus 7000']
We build new_bigrams
using a list comprehension that only adds bigrams if they are not contained in any of the trigrams. all(b not in t for t in trigrams)
returns False
if the bigram is a substring of any of the trigrams.
Upvotes: 2