curious
curious

Reputation: 730

using pandas make a string column into multiple columns with True/False

I have this:

df = pd.DataFrame({'my_col' : ['red', 'red', 'green']})

my_col
red
red
green

I want this: df2 = pd.DataFrame({'red' : [True, True, False], 'green' : [False, False, True]})

red  green
True  False
True  False
False   True

Is there an elegant way to do this?

Upvotes: 0

Views: 43

Answers (4)

Gonçalo Peres
Gonçalo Peres

Reputation: 13622

Considering that the original dataframe is df, one can use:

  1. pandas.get_dummies

  2. pandas.Series.str.get_dummies


Option 1

Using pandas.get_dummies, one can do the following

df2 = pd.get_dummies(df['my_col'], dtype=bool)

[Out]:

   green    red
0  False   True
1  False   True
2   True  False

If one wants the column red to appear first, a one-liner would look like the following

df2 = pd.get_dummies(df['my_col'], dtype=bool)[['red', 'green']]

[Out]:

     red  green
0   True  False
1   True  False
2  False   True

Option 2

Using pandas.Series.str.get_dummies, one can do the following

df2 = df['my_col'].str.get_dummies().astype(bool)

[Out]:

   green    red
0  False   True
1  False   True
2   True  False

If one wants the column red to appear first, a one-liner would look like the following

df2 = df['my_col'].str.get_dummies().astype(bool)[['red', 'green']]

[Out]:

     red  green
0   True  False
1   True  False
2  False   True

Upvotes: 1

Olasimbo
Olasimbo

Reputation: 1063

Python functionality get_dummies can work for this.

import pandas as pd
import numpy as np

df = pd.DataFrame({'my_col': ['red', 'red', 'green']})
new_df = pd.get_dummies(df, dtype=bool)
new_df[:] = np.where(pd.get_dummies(df, dtype=bool), 'True', 'False')

new_df.rename(columns={'my_col_green': 'green', 'my_col_red': 'red'}, inplace=True)
print(new_df)

Upvotes: 1

Naveed
Naveed

Reputation: 11650

# reset index, to keep the rows count
df=df.reset_index()

# create a cross tab (don't miss negation for the resultset)
~(pd.crosstab(index=[df['index'],df['my_col']], 
             columns=df['my_col'])
 .reset_index()                  # cleanup to match the output
 .drop(columns=['index','my_col']) # drop unwanted columns
 .rename_axis(columns=None)        # remove axis name
 .astype(bool))                    # make it boolean
    green   red
0   True    False
1   True    False
2   False   True

Upvotes: 0

T C Molenaar
T C Molenaar

Reputation: 3260

You can do this:

for color in df['my_col'].unique():
    df[color] = df['my_col'] == color

df2 = df[df['my_col'].unique()]

It will loop over each color in my_col and adds a column to df with the name of the color and True/False whether it is equal to the color. Finally extract df2 from df by selecting only the color columns.

Another option is to start with an empty dataframe for df2 and immediately add the columns to this dataframe:

df2 = pd.DataFrame()
for color in df['my_col'].unique():
    df2[color] = df['my_col'] == color

Output:

     red  green
0   True  False
1   True  False
2  False   True

Upvotes: 1

Related Questions