M_Oxford
M_Oxford

Reputation: 361

Python: Searching through multiple columns and identifying rows that contain any of the elements in a list

I recently posted a question on how to subset using R: Subset dataframe in R based on a list specified in a vector (using a 'starts with' expression or equivalent)

User @Ric S very kindly provided the R answer to this question. The R solution is:

library(dplyr)
df %>% 
  filter_at(vars(-ID), any_vars(grepl(paste(medications, collapse = "|"), .)))

However, my impression is that Python may be more time efficient. Just to summarize, I am trying to identify any participant prescribed statins in a dataset of over 1 million rows and subset based on this. I have a list that includes all the codes for these medications (I've just made a few up for demonstration purposes), and I would next like to search through the dataframe and subset it to anyone who has a medication code that "starts with" any of the elements on the list. The loop seems redundant in the example, but I will be running this code for a few drug classes.

Example dataset, called meds_df:

     ID readcode_1 readcode_2 generic_name
1  1001       bxd1 1146785342  Simvastatin
2  1002       <NA>       <NA>         <NA>
3  1003       <NA>       <NA>  Pravastatin
4  1004       <NA>       <NA>         <NA>
5  1005       bxd4   45432344         <NA>
6  1006       <NA>       <NA>         <NA>
7  1007       <NA>       <NA>         <NA>
8  1008       <NA>       <NA>         <NA>
9  1009       <NA>       <NA>         <NA>
10 1010       bxde       <NA>         <NA>
11 1011       <NA>       <NA>         <NA>

with the expected output being:

     ID readcode_1 readcode_2 generic_name
1  1001       bxd1 1146785342  Simvastatin
3  1003       <NA>       <NA>  Pravastatin
5  1005       bxd4   45432344         <NA>
10 1010       bxde       <NA>         <NA>

My code so far based on other stackoverflow questions, I'll also need to embed any() or an equivalent here:

list_to_extract = ["bxd", "Simvastatin", "1146785342", "45432344", "Pravastatin"]
variable_list = ['statins']
for m in variable_list:
    print('extracting individuals prescribed %s' %m)
test = meds_df.loc[meds_df['readcode_1', 'readcode_2','generic_name'].str.startswith(list_to_extract, na=False)]
    print(test)

Error recieved:

KeyError: ('readcode_1', 'readcode_2','generic_name')

Any help will be greatly appreciated!

Upvotes: 1

Views: 259

Answers (3)

DavideBrex
DavideBrex

Reputation: 2414

You can do it with apply in this way:

list_to_extract = ["bxd", "Simvastatin", "1146785342", "45432344", "Pravastatin"]

bool_df = df[['readcode_1', 'readcode_2','generic_name']].apply(lambda x: x.str.startswith(tuple(list_to_extract), na=False), axis=1)

df.loc[bool_df[bool_df.any(axis=1)].index]

Output:

    ID  readcode_1  readcode_2  generic_name
1   1001    bxd1    1.146785e+09    Simvastatin
3   1003    NaN     NaN             Pravastatin
5   1005    bxd4    4.543234e+07    NaN
10  1010    bxde    NaN             NaN

Thanks to r.ook for spotting a small mistake

Upvotes: 2

sammywemmy
sammywemmy

Reputation: 28699

An alternative solution, where the string processing occurs within vanilla python before recreating the dataframe :

list_to_extract = ["bxd", "Simvastatin", "1146785342", "45432344", "Pravastatin"]

cols_to_search = ['readcode_1', 'readcode_2','generic_name']

output = [(ID, *searchbox) 
          for ID, searchbox in zip(df.ID,df.filter(cols_to_search).to_numpy())
          if any([str(box).startswith(tuple(list_to_extract)) for box in searchbox])]

pd.DataFrame(output, columns = df.columns)


     ID readcode_1  readcode_2  generic_name
0   1001    bxd1     1.146785e+09   Simvastatin
1   1003    NaN      NaN            Pravastatin
2   1005    bxd4     4.543234e+07   NaN
3   1010    bxde     NaN            NaN

Upvotes: 1

r.ook
r.ook

Reputation: 13878

First of all, the proper syntax is meds_df[['readcode_1', 'readcode_2','generic_name']] (list of column names in a index slice). This is why you are getting a KeyError.

To answer your question, here is one way to accomplish it:

# Updated to use tuple per David's suggestion
idx = pd.concat((med_df[col].astype(str).str.startswith(tuple(list_to_extract)) for col in ['readcode_1', 'readcode_2','generic_name']), axis=1).any(axis=1)

med_df.loc[idx]

Result:

      ID readcode_1    readcode_2 generic_name
1   1001       bxd1  1.146785e+09  Simvastatin
3   1003        NaN           NaN  Pravastatin
5   1005       bxd4  4.543234e+07          NaN
10  1010       bxde           NaN          NaN

Upvotes: 2

Related Questions