Protima Rani Paul
Protima Rani Paul

Reputation: 210

Multi String search in Data frame in Multiple column with AND or OR Option

I can do a single word search in each column but unable to search user-provided number of string search with "and" "or" option

   0     1    3   4     
0 [OH-] [Na+] NAN CCO 
                                              
1 [OH-] [Na+] CCO  Cl  

This one works

search = 'CCO' 
df.loc[df.isin([search]).any(axis=1)].index.tolist()

                                      

For multi-search I tried

import re

terms = ['C1', 'CCO']
p = r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df[df['col'].str.contains(p)]

Gives me KeyError: 'col'

Expected output

Search='C1' AND '[NA+]

Results 1

Search='CCO' OR 'C1'

Results 0 1

Upvotes: 2

Views: 111

Answers (2)

Mark
Mark

Reputation: 4455

I created your dataframe this way:

df = pd.DataFrame( { 0 : ["[OH-]","[Na+]","NAN","CCO" ], 1 :  ["[OH-]","[Na+]","CCO","Cl"] } ).transpose()

Yielding this df:

       0      1    2    3
0  [OH-]  [Na+]  NAN  CCO
1  [OH-]  [Na+]  CCO   Cl

I observed that you can do your OR logic with the isin() function on the df:

df.isin(['CCO','C1'])

Yields:

       0      1      2      3
0  False  False  False   True
1  False  False   True  False

And so you can figure out which rows match using any(1) as you are using:

df.isin(['CCO','C1']).any(1).index.tolist()

Yields:

[0, 1]

Logic for AND:

The snippet below looks for each term individually and accumulates them in the results dataframe. After finding matching columns, the number of matches in each row is checked to see if it matches the number of terms.

results = pd.DataFrame()
terms = [ 'Cl', '[Na+]' ]  
for term in terms:
      if results.empty:
        results = df.isin( [ term ] ) 
      else:
        results |= df.isin( [ term ] ) 

results['count'] = results.sum(axis=1)
print( results[ results['count'] == len( terms )  ].index.tolist() ) 

I know there is a better way - but this way works (I think)

The above code yields [1] for terms = [ 'Cl', '[Na+]' ] and [0,1] for terms = [ "[OH-]","[Na+]" ] .

Upvotes: 1

Code Different
Code Different

Reputation: 93161

Because there isn't a column name col. Try this:

df[df.apply(lambda col: col.str.contains(p)).any(axis=1)]

col is now the name of an input parameter to the lambda.

Upvotes: 1

Related Questions