Reputation: 163
I have a pandas dataframe with a large number of columns and I need to find which columns are binary (with values 0 or 1 only) without looking at the data. Which function should be used?
Upvotes: 9
Views: 13920
Reputation: 1
Many years late, but here is my answer using nunique()
:
[col for col,val in df.nunique().items() if val==2]
Upvotes: 0
Reputation: 341
I may be late. But seeing that one may also need to get columns with binary-occuring features not necessarily already in [0,1] format, i.e. "Yes/No" , "True/False", the following could do.
binary_cols= [col for col in df.columns if len(df[col].unique())==2]
binary_cols
Upvotes: 1
Reputation: 1
You can just use the unique() function from pandas on each column in your dataset.
ex: df["colname"].unique()
This will return a list of all unique values in the specified column.
You can also use for loop to traverse through all the columns in the dataset.
ex: [df[cols].unique() for cols in df]
Upvotes: 0
Reputation: 109626
To my knowledge, there is no direct function to test for this. Rather, you need to build something based on how the data was encoded (e.g. 1/0, T/F, True/False, etc.). In addition, if your column has a missing value, the entire column will be encoded as a float instead of an int.
In the example below, I test whether all unique non null values are either '1' or '0'. It returns a list of all such columns.
df = pd.DataFrame({'bool': [1, 0, 1, None],
'floats': [1.2, 3.1, 4.4, 5.5],
'ints': [1, 2, 3, 4],
'str': ['a', 'b', 'c', 'd']})
bool_cols = [col for col in df
if df[[col]].dropna().unique().isin([0, 1]).all().values]
# 2019-09-10 EDIT (per Hardik Gupta)
bool_cols = [col for col in df
if np.isin(df[col].dropna().unique(), [0, 1]).all()]
>>> bool_cols
['bool']
>>> df[bool_cols]
bool
0 1
1 0
2 1
3 NaN
Upvotes: 10
Reputation: 4790
Using Alexander's answer, with python version - 3.6.6
[col for col in df if np.isin(df[col].unique(), [0, 1]).all()]
Upvotes: 3
Reputation: 61
def is_binary(series, allow_na=False):
if allow_na:
series.dropna(inplace=True)
return sorted(series.unique()) == [0, 1]
This is the most efficient solution I found. It is quicker than the answers above. When handling large data sets, the difference in timing becomes relevant.
Upvotes: 6
Reputation: 7313
Improving upon @Aiden to avoid returning an empty column:
[col for col in df if (len(df[col].value_counts()) > 0) & all(df[col].value_counts().index.isin([0, 1]))]
Upvotes: 1
Reputation: 41
To expand on the answer just above, using value_counts().index instead of unique() should do the trick:
bool_cols = [col for col in df if
df[col].dropna().value_counts().index.isin([0,1]).all()]
Upvotes: 4