Reputation: 332
I have pandas dataframe as:
df = pd.DataFrame([[1,0,0,1], [0,1,0,0], [0,0,0,0], [1,0,0,0]], columns=list("ABCD"))
>>> df
A B C D
0 1 0 0 1
1 0 1 0 0
2 0 0 0 0
3 1 0 0 0
I want to create a single column dataframe of same height as df
, with labels, as for each combination of those 1 and 0 in one row it assigns a different class (preferably numeric), i.e. this df should look like this:
>>> df_labels
x
0 0
1 1
2 2
3 3
Looking rather for solution based on already built-in functions from libraries such as pandas or sklearn, than coded from scratch, although any help is appreciated.
I came out with such solution for now:
from sklearn.preprocessing import LabelEncoder
labels = []
for i in range(0, len(df)):
# create string from every row
val = "".join([str(x) for x in df.loc[i]])
labels.append(val)
# encode numeric labels for strings created
enc = LabelEncoder()
enc.fit(labels)
df_labels = pd.DataFrame(enc.transform(labels))
>>> df_labels
0
0 3
1 1
2 0
3 2
However, is there better way to do it?
Upvotes: 0
Views: 76
Reputation: 25239
If you only need a general label encodes (not as in order as your desired output) to sepate combinations of columns 'A', 'B', 'C', 'D', using dot
is a simple way
n = np.arange(1, len(df.columns)+1)
Out[14]: array([1, 2, 3, 4])
df.dot(n)
Out[15]:
0 5
1 2
2 0
3 1
dtype: int64
So, each combination will be encoded as a unique value provided by dot
Upvotes: 1
Reputation: 323226
You can check with factorize
pd.factorize(df.apply(tuple,1))[0]
array([0, 1, 2, 3])
pd.Series(pd.factorize(df.apply(tuple,1))[0])
0 0
1 1
2 2
3 3
dtype: int64
Upvotes: 1
Reputation: 503
As far as I know there isn't a built-in method, but you can do something like this:
df.apply(lambda x: ('_').join(str(x.values)), axis=1).astype('category').cat.codes
Upvotes: 0