The Great
The Great

Reputation: 7693

get column names using criteria and column value using pandas?

I have a dataframe like as shown below

df = pd.DataFrame({'sub_id': [101,101,101,102,102,103,104],
                   'test_status':['Y','N','Yes','No','Not sure','NOT SURE','YES'],
                   'remarks':[np.nan,"testvalue",np.nan,"ilike","wild",np.nan,"test"],
                   'reg_value':['YES','NO','Y','N','SURE','YES','no'],
                   'TEST':['RES','YES','SAF','YWQTEU','JHLK','1265','QEEU']})

Basically, I would like to do the below

a) Get the names of the columns whichever has only values from the response list given below

response = ['Y','y','YES','Yes','yes','NO','No','no','N','n','NOT SURE','Not sure','not sure','sure','SURE','Sure']

The column should not contain any other value except the ones mentioned in response list.

For ex: column Test should not be in ouptut because along with YES, it contains other values which are not in response list.

So, I tried the below but the output is incorrect. not sure how to exclude TEST column.

df.apply(lambda row: row[row.isin(response)].index.tolist(), axis=1)

I expect my output to be like as shown below

op_col_list = ['test_status','reg_value']

So, I have two questions

a) How to exclude TEST column for being considered?

b) How to make response list elegant. As you can see the values are repeating due to uppper and lower case representation. Is there anyway to make a list case-insensitive

Upvotes: 0

Views: 715

Answers (3)

Henry Ecker
Henry Ecker

Reputation: 35626

To handle the case insensitivity we can use str.fullmatch with case=False:

# These can be any case:
response = ['Y', 'YES', 'NO', 'N', 'NOT SURE', 'SURE']

# Create a Regex pattern:
pattern = f'{"|".join(response)}'

# Apply to DataFrame columns
# (selected using loc can choose only columns needed):
cols = (
    df.loc[:, 'test_status':'reg_value']
        .apply(lambda s: s.str.fullmatch(pattern, case=False).all())
)

# Filter With Results:
cols = cols.index[cols].tolist()

cols before filter:

test_status     True
remarks        False
reg_value       True
dtype: bool

cols after filter and tolist:

['test_status', 'reg_value']

*fullmatch will check the cell contains only the value in the pattern. Can also use str.contains if looking for cells which contain a response value.

Upvotes: 1

Corralien
Corralien

Reputation: 120399

Use isin and keep only columns where all rows are True:

>>> df.columns[df.isin(response).all()].tolist()
['test_status', 'reg_value']

Modify response to be more elegant:

resp = fr"({'|'.join(set([r.lower() for r in response]))})
>>> df.select_dtypes('object').columns[df.select_dtypes('object').stack()
                                         .str.fullmatch(resp, case=False)
                                         .unstack().all()].tolist()

['test_status', 'reg_value']


>>> resp
'(sure|not sure|yes|y|no|n)'

Upvotes: 4

GabrielP
GabrielP

Reputation: 782

This can help you if you want to make it case insensitive:

response_lower = ['yes', 'not sure', 'n', 'y', 'no', 'sure']

searched_features = []
for feature in df:
    if df[feature].map(str).str.lower().isin(response_lower).mean() == 1:
        searched_features.append(feature)
searched_features

Upvotes: 1

Related Questions