Reputation: 85
I have a dataframe with binary values post performing get_dummies using pandas
df=
Values A1 A2 B1 B2 B3 B4 C1 C2 C3
10 1 0 1 0 0 0 1 0 0
12 0 1 0 0 1 0 0 1 0
3 0 1 0 1 0 0 0 0 1
5 1 0 0 0 0 1 1 0 0
I want a new column which has combination of all columns which has 1's in it
Expected output:
Values A1 A2 B1 B2 B3 B4 C1 C2 C3 Combination
10 1 0 1 0 0 0 1 0 0 A1~~B1~~C1
12 0 1 0 0 1 0 0 1 0 A2~~B3~~C2
3 0 1 0 1 0 0 0 0 1 A2~~B2~~C3
5 1 0 0 0 0 1 1 0 0 A1~~B4~~C3
Actual matrix can be 25000+ rows*1000+columns
There is a similar solution found in R but i need it in Python bcoz all other dependencies are in python and R is new to me.
Extract column names with value 1 in binary matrix
Codes in R below, & need similar one or any other code in python which can help me to arrive at my expected output
Solution 1 :
as.matrix(apply(m==1,1,function(a) paste0(colnames(m)[a], collapse = "")))
Solution 2:
t <- which(m==1, arr.ind = TRUE)
as.matrix(aggregate(col~row, cbind(row=rownames(t), col=t[,2]), function(x)
paste0(colnames(m)[x], collapse = "")))
how could something similar or arrive at my expected output in Python?
Upvotes: 5
Views: 478
Reputation: 18315
df["Combination"] = df.iloc[:, 1:].dot(df.add_suffix("~~").columns[1:]).str[:-2]
We select columns except for Values
with iloc
and then form a dot product where the second operand is respective columns of df
with ~~
added to the end. Result gives ~~
for the very end as well, so we chop it with .str[:-2]
to get
Values A1 A2 B1 B2 B3 B4 C1 C2 C3 Combination
0 10 1 0 1 0 0 0 1 0 0 A1~~B1~~C1
1 12 0 1 0 0 1 0 0 1 0 A2~~B3~~C2
2 3 0 1 0 1 0 0 0 0 1 A2~~B2~~C3
3 5 1 0 0 0 0 1 1 0 0 A1~~B4~~C1
Upvotes: 5
Reputation: 5500
You can try apply
with str.join
:
df["Combination"] = df.drop("Values", axis=1).apply(lambda x: "~~".join(x[x != 0].index), axis=1)
print(df)
# Values A1 A2 B1 B2 B3 B4 C1 C2 C3 Combination
# 0 10 1 0 1 0 0 0 1 0 0 A1~~B1~~C1
# 1 12 0 1 0 0 1 0 0 1 0 A2~~B3~~C2
# 2 3 0 1 0 1 0 0 0 0 1 A2~~B2~~C3
# 3 5 1 0 0 0 0 1 1 0 0 A1~~B4~~C1
Explanations:
Combination
, ignore Values
column. Several approach are possible (see this topic). Here I use drop
: df.drop("Values", axis=1)
.apply
and axis=1
0
using x[x != 0]
.index
str.join
to match desired output: "~~".join(x[x != 0].index)
Full illustration:
# Step 1
print(df.drop("Values", axis=1))
# A1 A2 B1 B2 B3 B4 C1 C2 C3
# 0 1 0 1 0 0 0 1 0 0
# 1 0 1 0 0 1 0 0 1 0
# 2 0 1 0 1 0 0 0 0 1
# 3 1 0 0 0 0 1 1 0 0
# Step 3
print(df.drop("Values", axis=1).apply(lambda x: x[x != 0], axis=1))
# A1 A2 B1 B2 B3 B4 C1 C2 C3
# 0 1.0 NaN 1.0 NaN NaN NaN 1.0 NaN NaN
# 1 NaN 1.0 NaN NaN 1.0 NaN NaN 1.0 NaN
# 2 NaN 1.0 NaN 1.0 NaN NaN NaN NaN 1.0
# 3 1.0 NaN NaN NaN NaN 1.0 1.0 NaN NaN
# Step 4
print(df.drop("Values", axis=1).apply(lambda x: x[x != 0].index, axis=1))
# 0 Index(['A1', 'B1', 'C1'], dtype='object')
# 1 Index(['A2', 'B3', 'C2'], dtype='object')
# 2 Index(['A2', 'B2', 'C3'], dtype='object')
# 3 Index(['A1', 'B4', 'C1'], dtype='object')
# Step 5
df["Combination"] = df.drop("Values", axis=1).apply(lambda x: "~~".join(x[x != 0].index), axis=1)
print(df)
# Values A1 A2 B1 B2 B3 B4 C1 C2 C3 Combination
# 0 10 1 0 1 0 0 0 1 0 0 A1~~B1~~C1
# 1 12 0 1 0 0 1 0 0 1 0 A2~~B3~~C2
# 2 3 0 1 0 1 0 0 0 0 1 A2~~B2~~C3
# 3 5 1 0 0 0 0 1 1 0 0 A1~~B4~~C1
Upvotes: 5
Reputation: 719
You can use the .dot
function of pandas after dropping your values
column to get the column names you want and then process the output with list comprehensions to obtain the right A1~~B1~~C1
format.
Here's what it looks looks like:
tmp = [list(i) for i in list(df.dot(df.columns))]
df['combination'] = ['~~'.join(i) for i in tmp]
print(df)
A B C D combination
0 1 1 0 0 A~~B
1 0 1 1 0 B~~C
2 0 0 1 0 C
Upvotes: 3