Reputation: 183
I have set of words
{'adalah',
'akan',
'akhir',
'algoritme',
'alur',
'antar',
'antisense',
'asam',
'atas',
'atau',
'bahwa',
'bakteriofag',
'baru',
'basa',
'beranggota',
'berdasarkan',
'berikatan',
'berupa',
'pada',...}
I tried to find whether the word in the set contained in the bigramPMITable dataframe that I had
bigram PMI
0 (itu, adalah) 11.487338
1 (DNA, pada) 6.386371
2 (pada, oleh) 6.386371
3 (pada, basa) 1.105795
4 (yang, satu) 1.105795
5 (gula, yang) 1.044394
6 (yang, tidak) 1.044394
7 (pada, DNA) 0.986496
8 (unting, dalam) 0.931790
9 (DNA, tidak) 0.925095
10 (DNA, menjadi) 0.925095
11 (dan, sebagai) 0.905196
12 (pada, unting) 0.834493
If so, then the expected output will be like this:
(itu, adalah) 11.487338
(DNA, pada) 6.386371
(pada, oleh) 6.386371
(pada, basa) 1.105795
(pada, DNA) 0.986496
(pada, unting) 0.834493
They found the word 'adalah' and 'pada' at the bigramPMITable dataframe. How do I find?. Can anyone can help? Thanks. Any help is much appreciated.
Upvotes: 1
Views: 496
Reputation: 862661
First solution with set
s and isdisjoint
and filter by boolean indexing
with inverted mask by ~
:
df1 = df[~df.bigram.map(s.isdisjoint)]
Or you can create helper DataFrame
with isin
:
df1 = df[pd.DataFrame(df['bigram'].tolist(), index=df.index).isin(s).any(axis=1)]
print (df1)
bigram PMI
0 (itu, adalah) 11.487338
1 (DNA, pada) 6.386371
2 (pada, oleh) 6.386371
3 (pada, basa) 1.105795
7 (pada, DNA) 0.986496
12 (pada, unting) 0.834493
Setup:
s = {'adalah',
'akan',
'akhir',
'algoritme',
'alur',
'antar',
'antisense',
'asam',
'atas',
'atau',
'bahwa',
'bakteriofag',
'baru',
'basa',
'beranggota',
'berdasarkan',
'berikatan',
'berupa',
'pada'}
Performance:
df = pd.concat([df] * 10000, ignore_index=True)
In [41]: %timeit df[~df.bigram.map(s.isdisjoint)]
21 ms ± 359 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [42]: %timeit df[pd.DataFrame(df['bigram'].tolist(), index=df.index).isin(s).any(axis=1)]
41.6 ms ± 5.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#yatu solutions
In [43]: %timeit df[df.bigram.map(s.intersection).ne(set())]
73.4 ms ± 4.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [44]: %timeit df[df.bigram.map(s.intersection).str.len().gt(0)]
127 ms ± 6.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 2
Reputation: 88236
Here's one approach using sets
(somewhat slower than jezrael's set.isdisjoint
approach):
df[df.bigram.map(s.intersection).ne(set())]
bigram PMI
0 (itu, adalah) 11.487338
1 (DNA, pada) 6.386371
2 (pada, oleh) 6.386371
3 (pada, basa) 1.105795
7 (pada, DNA) 0.986496
12 (pada, unting) 0.834493
Where:
s = {'adalah',
'akan',
'akhir',
'algoritme',
'alur',
'antar',
'antisense',
'asam',
'atas',
'atau',
'bahwa',
'bakteriofag',
'baru',
'basa',
'beranggota',
'berdasarkan',
'berikatan',
'berupa',
'pada'}
Upvotes: 1