Stephen Lee
Stephen Lee

Reputation: 35

Mining for Term that is "Included In" Entry Rather than "Equal To"

I am doing some data mining. I have a database that looks like this (pulling out three lines):

100324822$10032482$1$PS$BENICAR$OLMESARTAN MEDOXOMIL$1$Oral$UNK$$$Y$$$$021286$$$TABLET$ 1014687010$10146870$2$SS$BENICAR HCT$HYDROCHLOROTHIAZIDE\OLMESARTAN MEDOXOMIL$1$Oral$1/2 OF 40/25MG TABLET$$$Y$$$$$.5$DF$FILM-COATED TABLET$QD 115700162$11570016$5$C$Olmesartan$OLMESARTAN$1$Unknown$UNK$$$U$U$$$$$$$

My Code looks like this :

    with open('DRUG20Q4.txt') as fileDrug20Q4:
        drugTupleList20Q4 = [tuple(map(str, i.split('$'))) for i in fileDrug20Q4]
    drug20Q4 = []
    for entryDrugPrimaryID20Q4 in drugTupleList20Q4:
        drug20Q4.append((entryDrugPrimaryID20Q4[0], entryDrugPrimaryID20Q4[3], entryDrugPrimaryID20Q4[5]))
    fileDrug20Q4.close()

    drugNameDataFrame20Q4 = pd.DataFrame(drug20Q4, columns = ['PrimaryID', 'Role', 'Drug Name']) drugNameDataFrame20Q4 = pd.DataFrame(drugNameDataFrame20Q4.loc[drugNameDataFrame20Q4['Drug Name'] == 'OLMESARTAN'])

Currently the code will pull only entries with the exact name "OLMESARTAN" out, how do I capture all the variations, for instance "OLMESARTAN MEDOXOMIL" etc? I can't simply list all the varieties as there's an infinite amount of variations, so I would need something that captures anything with the term "OLMESARTAN" within it.

Thanks!

Upvotes: 0

Views: 31

Answers (1)

Kris
Kris

Reputation: 589

You can use str.contains to get what you are looking for.

Here's an example (using some string I found in the documentation):

import pandas as pd 
df = pd.DataFrame()
item = 'Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.'
df['test'] = item.split(' ')
df[df['test'].str.contains('de')]

This outputs:

    test
4   Index
22  Index.

Upvotes: 0

Related Questions