Reputation: 361
I recently posted a question on how to subset using R: Subset dataframe in R based on a list specified in a vector (using a 'starts with' expression or equivalent)
User @Ric S very kindly provided the R answer to this question. The R solution is:
library(dplyr)
df %>%
filter_at(vars(-ID), any_vars(grepl(paste(medications, collapse = "|"), .)))
However, my impression is that Python may be more time efficient. Just to summarize, I am trying to identify any participant prescribed statins in a dataset of over 1 million rows and subset based on this. I have a list that includes all the codes for these medications (I've just made a few up for demonstration purposes), and I would next like to search through the dataframe and subset it to anyone who has a medication code that "starts with" any of the elements on the list. The loop seems redundant in the example, but I will be running this code for a few drug classes.
Example dataset, called meds_df
:
ID readcode_1 readcode_2 generic_name
1 1001 bxd1 1146785342 Simvastatin
2 1002 <NA> <NA> <NA>
3 1003 <NA> <NA> Pravastatin
4 1004 <NA> <NA> <NA>
5 1005 bxd4 45432344 <NA>
6 1006 <NA> <NA> <NA>
7 1007 <NA> <NA> <NA>
8 1008 <NA> <NA> <NA>
9 1009 <NA> <NA> <NA>
10 1010 bxde <NA> <NA>
11 1011 <NA> <NA> <NA>
with the expected output being:
ID readcode_1 readcode_2 generic_name
1 1001 bxd1 1146785342 Simvastatin
3 1003 <NA> <NA> Pravastatin
5 1005 bxd4 45432344 <NA>
10 1010 bxde <NA> <NA>
My code so far based on other stackoverflow questions, I'll also need to embed any() or an equivalent here:
list_to_extract = ["bxd", "Simvastatin", "1146785342", "45432344", "Pravastatin"]
variable_list = ['statins']
for m in variable_list:
print('extracting individuals prescribed %s' %m)
test = meds_df.loc[meds_df['readcode_1', 'readcode_2','generic_name'].str.startswith(list_to_extract, na=False)]
print(test)
Error recieved:
KeyError: ('readcode_1', 'readcode_2','generic_name')
Any help will be greatly appreciated!
Upvotes: 1
Views: 259
Reputation: 2414
You can do it with apply in this way:
list_to_extract = ["bxd", "Simvastatin", "1146785342", "45432344", "Pravastatin"]
bool_df = df[['readcode_1', 'readcode_2','generic_name']].apply(lambda x: x.str.startswith(tuple(list_to_extract), na=False), axis=1)
df.loc[bool_df[bool_df.any(axis=1)].index]
Output:
ID readcode_1 readcode_2 generic_name
1 1001 bxd1 1.146785e+09 Simvastatin
3 1003 NaN NaN Pravastatin
5 1005 bxd4 4.543234e+07 NaN
10 1010 bxde NaN NaN
Thanks to r.ook for spotting a small mistake
Upvotes: 2
Reputation: 28699
An alternative solution, where the string processing occurs within vanilla python before recreating the dataframe :
list_to_extract = ["bxd", "Simvastatin", "1146785342", "45432344", "Pravastatin"]
cols_to_search = ['readcode_1', 'readcode_2','generic_name']
output = [(ID, *searchbox)
for ID, searchbox in zip(df.ID,df.filter(cols_to_search).to_numpy())
if any([str(box).startswith(tuple(list_to_extract)) for box in searchbox])]
pd.DataFrame(output, columns = df.columns)
ID readcode_1 readcode_2 generic_name
0 1001 bxd1 1.146785e+09 Simvastatin
1 1003 NaN NaN Pravastatin
2 1005 bxd4 4.543234e+07 NaN
3 1010 bxde NaN NaN
Upvotes: 1
Reputation: 13878
First of all, the proper syntax is meds_df[['readcode_1', 'readcode_2','generic_name']]
(list
of column names in a index slice). This is why you are getting a KeyError
.
To answer your question, here is one way to accomplish it:
# Updated to use tuple per David's suggestion
idx = pd.concat((med_df[col].astype(str).str.startswith(tuple(list_to_extract)) for col in ['readcode_1', 'readcode_2','generic_name']), axis=1).any(axis=1)
med_df.loc[idx]
Result:
ID readcode_1 readcode_2 generic_name
1 1001 bxd1 1.146785e+09 Simvastatin
3 1003 NaN NaN Pravastatin
5 1005 bxd4 4.543234e+07 NaN
10 1010 bxde NaN NaN
Upvotes: 2