How to recognize one-hot encoded columns in data frame

Question

In pandas data frame there are multiple binary features columns with binary values, and the challenge is to identify which column has one-hot labels/values(which column can be a part of the one-hot encoded vector) and which column is an independent feature and not a part of one-hot encoded labels/vector.

The data that I need to clean and preprocess somehow looks like this:

Rows   v1  v2  v3  v4  v5  v6  v7  v8  v9  v10 Label

0      1   1   0   0   0   0   0   0   0   0     0
1      0   0   0   0   0   0   1   0   0   0     0
2      0   1   0   1   0   0   0   1   0.5 0     0
3      0   0   0   0   0   1   0   0   0   1     0
4      0   0   0   0   1   0   0   0   0   0     1
5      0   0   0   0   0   0   1   0   0   0     1
6      0   0   0   1   0   0   0   0   0   1     1
7      0   0   1   0   1   0   0   0   0.2 0     0
8      0   0   0   0   0   1   0   0   0   1     0

Note: Need to find out a specific combination of columns in which we have one 1 and other zeros in a row which is as there can be some non-hotEncoded/independent binary columns.

By specific combination of columns in which we have one 1 and other zeros in a row, I mean a result/final combination of columns like this, where we have one 1 in a row(by excluding the other binary columns):

v1  v4  v5  v6  v7

1   0   0   0   0  
0   0   0   0   1   
0   1   0   0   0   
0   0   0   1   0 
0   0   1   0   0 
0   0   0   0   1  
0   1   0   0   0  
0   0   1   0   0 
0   0   0   1   0

Ehsan · Accepted Answer

What you want seems hard to overcome. I will provide directions. You want the maximum number of variables/factors that are independent. You start by calculating the dot product of binary variables (df is your data frame):

df = df[df.columns[~df.columns.isin(['Rows','Label','v9'])]]
df.v1.dot(df.v1)

     v1  v2  v3  v4  v5  v6  v7  v8  v10
v1    2   0   0   0   0   1   0   0    2
v2    0   2   0   1   0   0   0   1    0
v3    0   0   1   0   0   0   0   0    0
v4    0   1   0   2   0   0   0   1    1
v5    0   0   0   0   1   0   0   0    0
v6    1   0   0   0   0   1   0   0    1
v7    0   0   0   0   0   0   2   0    0
v8    0   1   0   1   0   0   0   1    0
v10   2   0   0   1   0   1   0   0    3

Now, you want the largest sub-matrix that is all 0 and symmetric. If you complement the above data frame (dot products of columns) binary (converting zeros to 1 and non-zero to 0) and create a graph from it as an adjacency matrix, your problem translates to finding maximum clique problem. Which to the best of my knowledge is both fixed-parameter intractable and hard to approximate. However, if the number of variables are small, you probably can find it using brute-forth or approximation algorithms.

How to recognize one-hot encoded columns in data frame

Answers (2)

Related Questions