ohai
ohai

Reputation: 183

Finding common words in Pandas dataframe

I have set of words

{'adalah',
 'akan',
 'akhir',
 'algoritme',
 'alur',
 'antar',
 'antisense',
 'asam',
 'atas',
 'atau',
 'bahwa',
 'bakteriofag',
 'baru',
 'basa',
 'beranggota',
 'berdasarkan',
 'berikatan',
 'berupa',
 'pada',...}

I tried to find whether the word in the set contained in the bigramPMITable dataframe that I had

    bigram         PMI
0  (itu, adalah)   11.487338
1  (DNA, pada)     6.386371
2  (pada, oleh)    6.386371
3  (pada, basa)    1.105795
4  (yang, satu)    1.105795
5  (gula, yang)    1.044394
6  (yang, tidak)   1.044394 
7  (pada, DNA)     0.986496
8  (unting, dalam) 0.931790
9  (DNA, tidak)    0.925095
10 (DNA, menjadi)  0.925095
11 (dan, sebagai)  0.905196
12 (pada, unting)  0.834493

If so, then the expected output will be like this:

(itu, adalah) 11.487338
(DNA, pada) 6.386371
(pada, oleh) 6.386371
(pada, basa) 1.105795
(pada, DNA) 0.986496
(pada, unting) 0.834493

They found the word 'adalah' and 'pada' at the bigramPMITable dataframe. How do I find?. Can anyone can help? Thanks. Any help is much appreciated.

Upvotes: 1

Views: 496

Answers (2)

jezrael
jezrael

Reputation: 862661

First solution with sets and isdisjoint and filter by boolean indexing with inverted mask by ~:

df1 = df[~df.bigram.map(s.isdisjoint)]

Or you can create helper DataFrame with isin:

df1 = df[pd.DataFrame(df['bigram'].tolist(), index=df.index).isin(s).any(axis=1)]

print (df1)
            bigram        PMI
0    (itu, adalah)  11.487338
1      (DNA, pada)   6.386371
2     (pada, oleh)   6.386371
3     (pada, basa)   1.105795
7      (pada, DNA)   0.986496
12  (pada, unting)   0.834493

Setup:

s = {'adalah',
 'akan',
 'akhir',
 'algoritme',
 'alur',
 'antar',
 'antisense',
 'asam',
 'atas',
 'atau',
 'bahwa',
 'bakteriofag',
 'baru',
 'basa',
 'beranggota',
 'berdasarkan',
 'berikatan',
 'berupa',
 'pada'}

Performance:

df = pd.concat([df] * 10000, ignore_index=True)

In [41]: %timeit df[~df.bigram.map(s.isdisjoint)] 
21 ms ± 359 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 
In [42]: %timeit df[pd.DataFrame(df['bigram'].tolist(), index=df.index).isin(s).any(axis=1)] 
41.6 ms ± 5.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#yatu solutions
In [43]: %timeit df[df.bigram.map(s.intersection).ne(set())] 
73.4 ms ± 4.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 
In [44]: %timeit df[df.bigram.map(s.intersection).str.len().gt(0)] 
127 ms ± 6.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 

Upvotes: 2

yatu
yatu

Reputation: 88236

Here's one approach using sets (somewhat slower than jezrael's set.isdisjoint approach):

df[df.bigram.map(s.intersection).ne(set())]

         bigram        PMI
0    (itu, adalah)  11.487338
1      (DNA, pada)   6.386371
2     (pada, oleh)   6.386371
3     (pada, basa)   1.105795
7      (pada, DNA)   0.986496
12  (pada, unting)   0.834493

Where:

s = {'adalah',
 'akan',
 'akhir',
 'algoritme',
 'alur',
 'antar',
 'antisense',
 'asam',
 'atas',
 'atau',
 'bahwa',
 'bakteriofag',
 'baru',
 'basa',
 'beranggota',
 'berdasarkan',
 'berikatan',
 'berupa',
 'pada'}

Upvotes: 1

Related Questions