Anand Thirtha
Anand Thirtha

Reputation: 85

Extract column names & combine them with delimiter if value =1 (binary values) and put it in new column

I have a dataframe with binary values post performing get_dummies using pandas

df= 
Values  A1  A2  B1  B2  B3  B4  C1  C2  C3
10      1   0   1   0   0   0   1   0   0
12      0   1   0   0   1   0   0   1   0
3       0   1   0   1   0   0   0   0   1
5       1   0   0   0   0   1   1   0   0

I want a new column which has combination of all columns which has 1's in it

Expected output:

Values  A1  A2  B1  B2  B3  B4  C1  C2  C3  Combination
10      1   0   1   0   0   0   1   0   0   A1~~B1~~C1
12      0   1   0   0   1   0   0   1   0   A2~~B3~~C2
3       0   1   0   1   0   0   0   0   1   A2~~B2~~C3
5       1   0   0   0   0   1   1   0   0   A1~~B4~~C3

Actual matrix can be 25000+ rows*1000+columns

There is a similar solution found in R but i need it in Python bcoz all other dependencies are in python and R is new to me.

Extract column names with value 1 in binary matrix

Codes in R below, & need similar one or any other code in python which can help me to arrive at my expected output
Solution 1 : 
as.matrix(apply(m==1,1,function(a) paste0(colnames(m)[a], collapse = "")))

Solution 2: 
t <- which(m==1, arr.ind = TRUE)
as.matrix(aggregate(col~row, cbind(row=rownames(t), col=t[,2]), function(x) 
                                                    paste0(colnames(m)[x], collapse = "")))

how could something similar or arrive at my expected output in Python?

Upvotes: 5

Views: 478

Answers (3)

Mustafa Aydın
Mustafa Aydın

Reputation: 18315

df["Combination"] = df.iloc[:, 1:].dot(df.add_suffix("~~").columns[1:]).str[:-2]

We select columns except for Values with iloc and then form a dot product where the second operand is respective columns of df with ~~ added to the end. Result gives ~~ for the very end as well, so we chop it with .str[:-2]

to get

   Values  A1  A2  B1  B2  B3  B4  C1  C2  C3 Combination
0      10   1   0   1   0   0   0   1   0   0  A1~~B1~~C1
1      12   0   1   0   0   1   0   0   1   0  A2~~B3~~C2
2       3   0   1   0   1   0   0   0   0   1  A2~~B2~~C3
3       5   1   0   0   0   0   1   1   0   0  A1~~B4~~C1

Upvotes: 5

Alexandre B.
Alexandre B.

Reputation: 5500

You can try apply with str.join:

df["Combination"] = df.drop("Values", axis=1).apply(lambda x: "~~".join(x[x != 0].index), axis=1)

print(df)
#    Values  A1  A2  B1  B2  B3  B4  C1  C2  C3 Combination
# 0      10   1   0   1   0   0   0   1   0   0  A1~~B1~~C1
# 1      12   0   1   0   0   1   0   0   1   0  A2~~B3~~C2
# 2       3   0   1   0   1   0   0   0   0   1  A2~~B2~~C3
# 3       5   1   0   0   0   0   1   1   0   0  A1~~B4~~C1

Explanations:

  1. For calculating the Combination, ignore Values column. Several approach are possible (see this topic). Here I use drop: df.drop("Values", axis=1).
  2. Apply a custom function over each row using apply and axis=1
  3. In the function, filter values different from 0 using x[x != 0]
  4. Select column name (here there are index of the Serie) using .index
  5. Use str.join to match desired output: "~~".join(x[x != 0].index)

Full illustration:

# Step 1
print(df.drop("Values", axis=1))
#    A1  A2  B1  B2  B3  B4  C1  C2  C3
# 0   1   0   1   0   0   0   1   0   0
# 1   0   1   0   0   1   0   0   1   0
# 2   0   1   0   1   0   0   0   0   1
# 3   1   0   0   0   0   1   1   0   0

# Step 3
print(df.drop("Values", axis=1).apply(lambda x: x[x != 0], axis=1))
#     A1   A2   B1   B2   B3   B4   C1   C2   C3
# 0  1.0  NaN  1.0  NaN  NaN  NaN  1.0  NaN  NaN
# 1  NaN  1.0  NaN  NaN  1.0  NaN  NaN  1.0  NaN
# 2  NaN  1.0  NaN  1.0  NaN  NaN  NaN  NaN  1.0
# 3  1.0  NaN  NaN  NaN  NaN  1.0  1.0  NaN  NaN

# Step 4
print(df.drop("Values", axis=1).apply(lambda x: x[x != 0].index, axis=1))
# 0    Index(['A1', 'B1', 'C1'], dtype='object')
# 1    Index(['A2', 'B3', 'C2'], dtype='object')
# 2    Index(['A2', 'B2', 'C3'], dtype='object')
# 3    Index(['A1', 'B4', 'C1'], dtype='object')

# Step 5
df["Combination"] = df.drop("Values", axis=1).apply(lambda x: "~~".join(x[x != 0].index), axis=1)
print(df)
#    Values  A1  A2  B1  B2  B3  B4  C1  C2  C3 Combination
# 0      10   1   0   1   0   0   0   1   0   0  A1~~B1~~C1
# 1      12   0   1   0   0   1   0   0   1   0  A2~~B3~~C2
# 2       3   0   1   0   1   0   0   0   0   1  A2~~B2~~C3
# 3       5   1   0   0   0   0   1   1   0   0  A1~~B4~~C1

Upvotes: 5

zanga
zanga

Reputation: 719

You can use the .dot function of pandas after dropping your values column to get the column names you want and then process the output with list comprehensions to obtain the right A1~~B1~~C1 format.

Here's what it looks looks like:

tmp = [list(i) for i in list(df.dot(df.columns))]
df['combination'] = ['~~'.join(i) for i in tmp]

print(df)

    A   B   C   D   combination
0   1   1   0   0   A~~B
1   0   1   1   0   B~~C
2   0   0   1   0   C

Upvotes: 3

Related Questions