lamlamboy
lamlamboy

Reputation: 13

Concatenate column names by using the binary values in the columns

Currently, I have a dataframe as follows:

date          A    B   C
02/19/2020    0    0   0 
02/20/2020    0    0   0
02/21/2020    1    1   1
02/22/2020    0    1   0
02/23/2020    0    1   1
02/24/2020    0    0   1
02/25/2020    1    0   1
02/26/2020    1    0   0

The binary columns contain integers. The "date" column is a DateTime object. I want to create a new categorical column that is based on the binary columns as follows

date          A    B   C   new
02/19/2020    0    0   0   "None"
02/20/2020    0    0   0   "None"
02/21/2020    1    1   1   A+B+C
02/22/2020    0    1   0   B
02/23/2020    0    1   1   B+C
02/24/2020    0    0   1   C
02/25/2020    1    0   1   A+C
02/26/2020    1    0   0   A

How can I achieve this?

Upvotes: 1

Views: 132

Answers (2)

Y04V
Y04V

Reputation: 19

You can iterate over the Dataframe to calculate the new columns values and then add it.

This is a basic example

new_column = []
for i, row in df.iterrows():
    row_val = None

    if row["A"]:
       if row_val:
          row_val += "+A"
       else:
          row_val = "A"

    if row["B"]:
       if row_val:
          row_val += "+B"
       else:
          row_val = "B"

    if row["C"]:
       if row_val:
          row_val += "+C"
       else:
          row_val = "C"
    
    if row_val is None:
       row_val = "None"
    
    new_column.append(row_val)

df["new_column_name"] = new_column

Upvotes: 0

jezrael
jezrael

Reputation: 863291

Use DataFrame.dot for matrix multiplication with columns names with omit first column by position in DataFrame.iloc, add separator to columns names without first and last remove separator by indexing str[:-1]:

df['new'] = df.iloc[:, 1:].dot(df.columns[1:] + '+').str[:-1]
#set empty string to None
df.loc[df['new'].eq(''), 'new'] = None
print (df)
         date  A  B  C    new
0  02/19/2020  0  0  0   None
1  02/20/2020  0  0  0   None
2  02/21/2020  1  1  1  A+B+C
3  02/22/2020  0  1  0      B
4  02/23/2020  0  1  1    B+C
5  02/24/2020  0  0  1      C
6  02/25/2020  1  0  1    A+C
7  02/26/2020  1  0  0      A

If possible use NaNs instead Nones:

df['new'] = df.iloc[:, 1:].dot(df.columns[1:] + '+').str[:-1].replace('', np.nan)

print (df)
         date  A  B  C    new
0  02/19/2020  0  0  0    NaN
1  02/20/2020  0  0  0    NaN
2  02/21/2020  1  1  1  A+B+C
3  02/22/2020  0  1  0      B
4  02/23/2020  0  1  1    B+C
5  02/24/2020  0  0  1      C
6  02/25/2020  1  0  1    A+C
7  02/26/2020  1  0  0      A

Or if possible set first column to DatetimeIndex use:

df1 = df.set_index('date')
df1['new'] = df1.dot(df1.columns + '+').str[:-1]
df1.loc[df1['new'].eq(''), 'new'] = None

Upvotes: 3

Related Questions