The Great
The Great

Reputation: 7733

How to use mix of Regex and Exact literal match to fetch index value

I have a dataframe which can be created from the code given below

 df2= pd.DataFrame({'level_0': ['No case 
 notes','Notes','1.Chinese','2.Widowed','No']})

It looks like as shown below

enter image description here

I also have an input list which is given below

input_terms = ['No','Widowed','Chinese']

I would like to search these terms in dataframe and get their index.

How can I get my output to be like this

[4,3,2] - #This is the output index list from dataframe for my input terms

As you can see, I don't want the result set include the terms 'No case notes','Notes' though they contains 'No' as part of its string - Here I am doing a exact match

But for the input terms 'Chinese' and 'Widowed', I want the result set to include '1.Chinese' and '2.Widowed' - Here I am interested in something like str.contains method

How can I apply a mix of exact and regex/str.contains approach to search a string?

Upvotes: 1

Views: 139

Answers (2)

jezrael
jezrael

Reputation: 863236

If order of index values is not important:

df2= pd.DataFrame({'level_0': ['No case notes','notes','1.Chinese','2.Widowed','No']})

input_terms = ['No','Widowed','Chinese']

pat = '|'.join(r"\d+\.{}$".format(x) for x in input_terms)
m1 = df2['level_0'].str.contains(pat)
m2 = df2['level_0'].isin(input_terms)

idx = df2.index[m1 | m2]
print (idx)
Int64Index([2, 3, 4], dtype='int64')

If order is important:

input_terms = ['No','Widowed','Chinese']

out = []
for x in input_terms:
    a = df2.index[df2['level_0'] == x]
    b = df2.index[df2['level_0'].str.contains(r'\d+\.{}$'.format(x))]

print (out)
[4, 3, 2]

Upvotes: 2

Sweeper
Sweeper

Reputation: 272895

Try this regex:

^[^a-zA-Z]*XXX[^a-zA-Z]*$

replace XXX with the search terms (remember to escape them!). For example:

^[^a-zA-Z]*(?:Chinese|No|Widowed)[^a-zA-Z]*$

This is kind of a mix between str.contains and exact matches. It will basically ignore certain characters (in this case, everything that is not a-zA-Z), and do an exact match. If you want to ignore a different set of characters, just change the two character classes at the two ends. For example, if you want to ignore spaces as well:

^[^a-zA-Z\s]*XXX[^a-zA-Z\s]*$

Upvotes: 2

Related Questions