na899
na899

Reputation: 163

Which columns are binary in a Pandas DataFrame?

I have a pandas dataframe with a large number of columns and I need to find which columns are binary (with values 0 or 1 only) without looking at the data. Which function should be used?

Upvotes: 9

Views: 13920

Answers (8)

Dimitrosky
Dimitrosky

Reputation: 1

Many years late, but here is my answer using nunique():

[col for col,val in df.nunique().items() if val==2]

Upvotes: 0

Walker
Walker

Reputation: 341

I may be late. But seeing that one may also need to get columns with binary-occuring features not necessarily already in [0,1] format, i.e. "Yes/No" , "True/False", the following could do.

binary_cols= [col for col in df.columns if len(df[col].unique())==2]
binary_cols

Upvotes: 1

Manish Baswal
Manish Baswal

Reputation: 1

You can just use the unique() function from pandas on each column in your dataset.

ex: df["colname"].unique()

This will return a list of all unique values in the specified column.

You can also use for loop to traverse through all the columns in the dataset.

ex: [df[cols].unique() for cols in df]

Upvotes: 0

Alexander
Alexander

Reputation: 109626

To my knowledge, there is no direct function to test for this. Rather, you need to build something based on how the data was encoded (e.g. 1/0, T/F, True/False, etc.). In addition, if your column has a missing value, the entire column will be encoded as a float instead of an int.

In the example below, I test whether all unique non null values are either '1' or '0'. It returns a list of all such columns.

df = pd.DataFrame({'bool': [1, 0, 1, None], 
                   'floats': [1.2, 3.1, 4.4, 5.5], 
                   'ints': [1, 2, 3, 4], 
                   'str': ['a', 'b', 'c', 'd']})

bool_cols = [col for col in df 
             if df[[col]].dropna().unique().isin([0, 1]).all().values]

# 2019-09-10 EDIT (per Hardik Gupta)
bool_cols = [col for col in df 
             if np.isin(df[col].dropna().unique(), [0, 1]).all()]

>>> bool_cols
['bool']

>>> df[bool_cols]
   bool
0     1
1     0
2     1
3   NaN

Upvotes: 10

Hardik Gupta
Hardik Gupta

Reputation: 4790

Using Alexander's answer, with python version - 3.6.6

[col for col in df if np.isin(df[col].unique(), [0, 1]).all()]

Upvotes: 3

lucas
lucas

Reputation: 61

def is_binary(series, allow_na=False):
    if allow_na:
        series.dropna(inplace=True)
    return sorted(series.unique()) == [0, 1]

This is the most efficient solution I found. It is quicker than the answers above. When handling large data sets, the difference in timing becomes relevant.

Upvotes: 6

sedeh
sedeh

Reputation: 7313

Improving upon @Aiden to avoid returning an empty column:

[col for col in df if (len(df[col].value_counts()) > 0) & all(df[col].value_counts().index.isin([0, 1]))]

Upvotes: 1

Aiden
Aiden

Reputation: 41

To expand on the answer just above, using value_counts().index instead of unique() should do the trick:

bool_cols = [col for col in df if 
               df[col].dropna().value_counts().index.isin([0,1]).all()]

Upvotes: 4

Related Questions