rgl
rgl

Reputation: 3

I have a dataset where I need to convert some columns into single categorical variables & concatenating multi true values

I have a dataset with unique ID with columns which have several bool values for each ID. Therefore, I need to convert these columns into a single categorical variables concatenating multiple true bool values. And if out of mentioned bool ID have 3 true values we assign to categories as "Win"

ID BoolCol_1 BoolCol_2 BoolCol_3 BoolCol_4 Other Col 1 Other Col 2
1 1 2 2 1 x Y
2 2 1 1 1 A b

1 -> True 2 -> False

ID are unique.

I am not able to think in my head how to solve this puzzle

Upvotes: 0

Views: 246

Answers (2)

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

Use the following approach:

bool_cols = ['BoolCol_1', 'BoolCol_2', 'BoolCol_3', 'BoolCol_4']
cnts = df[bool_cols].stack().groupby(level=0).value_counts().unstack()[1]
df['cat_col'] = pd.Series(np.where(cnts >= 3, 'W', 'L'), dtype='category')

Now cat_col is categorical column with fixed values W (win), L (lose)


In [229]: df
Out[229]: 
   ID  BoolCol_1  BoolCol_2  BoolCol_3  BoolCol_4 Other Col 1 Other Col 2 cat_col
0   1          1          2          2          1          x            Y       L
1   2          2          1          1          1          A            b       W

Upvotes: 0

Joon
Joon

Reputation: 91

Welcome to SO, rgl!

So in our case here, the numeric Boolean value for "True" is 1 and the numeric value for "False" is 2. The trick behind doing operations using Boolean values is to use addition and subtraction.

The first step is to add up all of the Boolean values contained in each row and append these values under a new column:

# Sum of Booleans in the row
df['sum_of_wins_and_losses'] = df.BoolCol_1 + df.BoolCol_2 + df.BoolCol_3 + df.BoolCol_4

enter image description here

The next step is to now write a simple function that uses if and else statements based on the logic you are looking for. You noted that there must be at least three True values in order for each ID to be considered a "Win". This is where you need to be a bit careful.

Here, the minimum sum is 4 if all four rows are "True" whereas the maximum sum is 8 if all four rows are "False". To be considered a "Win", each ID needs to have a maximum sum of 5 or less. A value of 5 basically means three wins and one loss (1+1+1+2=5).

# Write function that contains the logic
def assign_win_or_loss(row):
    if row <= 5:
        result = 'win'
    else:
        result = 'loss'
    return result

Now that we have defined the function, it is time to apply it to the dataframe and create a new column containing our categorical variables:

# Apply function and create a new column based on values in other column
df['win_or_loss'] = df['sum_of_wins_and_losses'].apply(assign_win_or_loss)

enter image description here

Upvotes: 0

Related Questions