Reputation: 337
In pandas data frame there are multiple binary features columns with binary values, and the challenge is to identify which column has one-hot labels/values(which column can be a part of the one-hot encoded vector) and which column is an independent feature and not a part of one-hot encoded labels/vector.
The data that I need to clean and preprocess somehow looks like this:
Rows v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 Label
0 1 1 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 1 0 0 0 0
2 0 1 0 1 0 0 0 1 0.5 0 0
3 0 0 0 0 0 1 0 0 0 1 0
4 0 0 0 0 1 0 0 0 0 0 1
5 0 0 0 0 0 0 1 0 0 0 1
6 0 0 0 1 0 0 0 0 0 1 1
7 0 0 1 0 1 0 0 0 0.2 0 0
8 0 0 0 0 0 1 0 0 0 1 0
Note: Need to find out a specific combination of columns in which we have one 1 and other zeros in a row which is as there can be some non-hotEncoded/independent binary columns.
By specific combination of columns in which we have one 1 and other zeros in a row, I mean a result/final combination of columns like this, where we have one 1 in a row(by excluding the other binary columns):
v1 v4 v5 v6 v7
1 0 0 0 0
0 0 0 0 1
0 1 0 0 0
0 0 0 1 0
0 0 1 0 0
0 0 0 0 1
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
Upvotes: 2
Views: 706
Reputation: 12417
What you want seems hard to overcome. I will provide directions. You want the maximum number of variables/factors that are independent. You start by calculating the dot product of binary variables (df
is your data frame):
df = df[df.columns[~df.columns.isin(['Rows','Label','v9'])]]
df.v1.dot(df.v1)
v1 v2 v3 v4 v5 v6 v7 v8 v10
v1 2 0 0 0 0 1 0 0 2
v2 0 2 0 1 0 0 0 1 0
v3 0 0 1 0 0 0 0 0 0
v4 0 1 0 2 0 0 0 1 1
v5 0 0 0 0 1 0 0 0 0
v6 1 0 0 0 0 1 0 0 1
v7 0 0 0 0 0 0 2 0 0
v8 0 1 0 1 0 0 0 1 0
v10 2 0 0 1 0 1 0 0 3
Now, you want the largest sub-matrix that is all 0 and symmetric. If you complement the above data frame (dot products of columns) binary (converting zeros to 1 and non-zero to 0) and create a graph from it as an adjacency matrix, your problem translates to finding maximum clique problem. Which to the best of my knowledge is both fixed-parameter intractable and hard to approximate. However, if the number of variables are small, you probably can find it using brute-forth or approximation algorithms.
Upvotes: 3
Reputation: 21749
I think you can do that based on dtypes
:
print(df.columns[df.dtypes != 'float'])
Index(['Rows', 'v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v8', 'v10', 'Label'], dtype='object')
You can also do based on count (take columns with 2 unique values)
df.columns[df.apply(pd.Series.nunique) == 2]
Upvotes: 1