Reputation: 7693
I have a dataframe like as shown below
df = pd.DataFrame({'sub_id': [101,101,101,102,102,103,104],
'test_status':['Y','N','Yes','No','Not sure','NOT SURE','YES'],
'remarks':[np.nan,"testvalue",np.nan,"ilike","wild",np.nan,"test"],
'reg_value':['YES','NO','Y','N','SURE','YES','no'],
'TEST':['RES','YES','SAF','YWQTEU','JHLK','1265','QEEU']})
Basically, I would like to do the below
a) Get the names of the columns whichever has only values from the response
list given below
response = ['Y','y','YES','Yes','yes','NO','No','no','N','n','NOT SURE','Not sure','not sure','sure','SURE','Sure']
The column should not contain any other value except the ones mentioned in response list
.
For ex: column Test
should not be in ouptut because along with YES
, it contains other values which are not in response
list.
So, I tried the below but the output is incorrect. not sure how to exclude TEST
column.
df.apply(lambda row: row[row.isin(response)].index.tolist(), axis=1)
I expect my output to be like as shown below
op_col_list = ['test_status','reg_value']
So, I have two questions
a) How to exclude TEST
column for being considered?
b) How to make response
list elegant. As you can see the values are repeating due to uppper and lower case representation. Is there anyway to make a list case-insensitive
Upvotes: 0
Views: 715
Reputation: 35626
To handle the case insensitivity we can use str.fullmatch
with case=False
:
# These can be any case:
response = ['Y', 'YES', 'NO', 'N', 'NOT SURE', 'SURE']
# Create a Regex pattern:
pattern = f'{"|".join(response)}'
# Apply to DataFrame columns
# (selected using loc can choose only columns needed):
cols = (
df.loc[:, 'test_status':'reg_value']
.apply(lambda s: s.str.fullmatch(pattern, case=False).all())
)
# Filter With Results:
cols = cols.index[cols].tolist()
cols
before filter:
test_status True
remarks False
reg_value True
dtype: bool
cols
after filter and tolist
:
['test_status', 'reg_value']
*fullmatch
will check the cell contains only the value in the pattern. Can also use str.contains
if looking for cells which contain a response value.
Upvotes: 1
Reputation: 120399
Use isin
and keep only columns where all rows are True:
>>> df.columns[df.isin(response).all()].tolist()
['test_status', 'reg_value']
Modify response
to be more elegant:
resp = fr"({'|'.join(set([r.lower() for r in response]))})
>>> df.select_dtypes('object').columns[df.select_dtypes('object').stack()
.str.fullmatch(resp, case=False)
.unstack().all()].tolist()
['test_status', 'reg_value']
>>> resp
'(sure|not sure|yes|y|no|n)'
Upvotes: 4
Reputation: 782
This can help you if you want to make it case insensitive:
response_lower = ['yes', 'not sure', 'n', 'y', 'no', 'sure']
searched_features = []
for feature in df:
if df[feature].map(str).str.lower().isin(response_lower).mean() == 1:
searched_features.append(feature)
searched_features
Upvotes: 1