Reputation: 2243
I want to go from this data frame which is basically one hot encoded.
In [2]: pd.DataFrame({"monkey":[0,1,0],"rabbit":[1,0,0],"fox":[0,0,1]})
Out[2]:
fox monkey rabbit
0 0 0 1
1 0 1 0
2 1 0 0
3 0 0 0
4 0 0 0
To this one which is 'reverse' one-hot encoded.
In [3]: pd.DataFrame({"animal":["monkey","rabbit","fox"]})
Out[3]:
animal
0 monkey
1 rabbit
2 fox
I imagine there's some sort of clever use of apply or zip to do thins but I'm not sure how... Can anyone help?
I've not had much success using indexing etc to try to solve this problem.
Upvotes: 51
Views: 62303
Reputation: 210832
UPDATE: as Henry Ecker has already mentioned in his answer, as of Pandas 1.5.0 there is a native Pandas method for doing this - pandas.from_dummies()
Demo:
In [35]: s = pd.Series(['dog', 'cat', 'dog', 'bird', 'fox', 'dog'])
In [36]: dummies = pd.get_dummies(s)
In [37]: dummies
Out[37]:
bird cat dog fox
0 0 0 1 0
1 0 1 0 0
2 0 0 1 0
3 1 0 0 0
4 0 0 0 1
5 0 0 1 0
In [38]: pd.from_dummies(dummies)
Out[38]:
0 dog
1 cat
2 dog
3 bird
4 fox
5 dog
NOTE: the pd.from_dummies()
might work improperly, if dummies have been created with drop_first=True
parameter, like: pd.get_dummies(data, drop_first=True)
Demo:
In [39]: dummies = pd.get_dummies(s, drop_first=True)
In [40]: dummies
Out[40]:
cat dog fox
0 0 1 0
1 1 0 0
2 0 1 0
3 0 0 0
4 0 0 1
5 0 1 0
In [41]: pd.from_dummies(dummies)
...
ValueError: Dummy DataFrame contains unassigned value(s); First instance in row: 3
OLD ANSWER: i think ayhan is right and it should be:
df.idxmax(axis=1)
This chooses a column label for each row, where the label has the maximum value. Since the data are 1s and 0s, it will pick the positions of 1s.
Demo:
In [40]: s = pd.Series(['dog', 'cat', 'dog', 'bird', 'fox', 'dog'])
In [41]: s
Out[41]:
0 dog
1 cat
2 dog
3 bird
4 fox
5 dog
dtype: object
In [42]: pd.get_dummies(s)
Out[42]:
bird cat dog fox
0 0.0 0.0 1.0 0.0
1 0.0 1.0 0.0 0.0
2 0.0 0.0 1.0 0.0
3 1.0 0.0 0.0 0.0
4 0.0 0.0 0.0 1.0
5 0.0 0.0 1.0 0.0
In [43]: pd.get_dummies(s).idxmax(1)
Out[43]:
0 dog
1 cat
2 dog
3 bird
4 fox
5 dog
dtype: object
Upvotes: 87
Reputation: 35626
As of pandas 1.5.0, reversing one-hot encoding is supported directly with pandas.from_dummies
:
import pandas as pd # v 1.5.0
onehot_df = pd.DataFrame({
"monkey": [0, 1, 0],
"rabbit": [1, 0, 0],
"fox": [0, 0, 1]
})
new_df = pd.from_dummies(onehot_df)
#
# 0 rabbit
# 1 monkey
# 2 fox
The resulting DataFrame appears to have no column header (it's an empty string). To fix this, rename
the column after from_dummies
new_df = pd.from_dummies(onehot_df).rename(columns={'': 'animal'})
# animal
# 0 rabbit
# 1 monkey
# 2 fox
Alternatively, if the DataFrame is already defined with separated columns (like one-hot encoding produced by pandas.get_dummies
), e.g.
import pandas as pd # v 1.5.0
onehot_df = pd.DataFrame({
'animal_fox': [0, 0, 1],
'animal_monkey': [0, 1, 0],
'animal_rabbit': [1, 0, 0]
})
# animal_fox animal_monkey animal_rabbit
# 0 0 0 1
# 1 0 1 0
# 2 1 0 0
Simply specify the sep
to reverse the encoding
new_df = pd.from_dummies(onehot_df, sep='_')
# animal
# 0 rabbit
# 1 monkey
# 2 fox
The string before the first instance of the sep
delimiter will become the column header in the new DataFrame (in this case "animal") and the rest of the string will become the column values (in this case "rabbit", "monkey", "fox").
Upvotes: 4
Reputation: 79
A way to deal with multiple labels without a for cycle. The result will be a list column. If you have the same number of labels in each row, you can add result_type='expand'
to get several columns.
df.apply(lambda x: df.columns[x==1], axis=1)
Upvotes: 0
Reputation: 63
It can be achieved with a simple apply on dataframe
# function to get column name with value one for each row in dataframe
def get_animal(row):
return(row.index[row.apply(lambda x: x==1)][0])
# prepare a animal column
df['animal'] = df.apply(lambda row:get_animal(row), axis=1)
Upvotes: 0
Reputation: 349
You could try using melt()
. This method also works when you have multiple OHE labels for a row.
# Your OHE dataframe
df = pd.DataFrame({"monkey":[0,1,0],"rabbit":[1,0,0],"fox":[0,0,1]})
mel = df.melt(var_name=['animal'], value_name='value') # Melting
mel[mel.value == 1].reset_index(drop=True) # this gives you the result
Upvotes: 3
Reputation: 2083
This works with both single and multiple labels.
We can use advanced indexing to tackle this problem. Here is the link.
import pandas as pd
df = pd.DataFrame({"monkey":[1,1,0,1,0],"rabbit":[1,1,1,1,0],\
"fox":[1,0,1,0,0], "cat":[0,0,0,0,1]})
df['tags']='' # to create an empty column
for col_name in df.columns:
df.ix[df[col_name]==1,'tags']= df['tags']+' '+col_name
print df
And the result is:
cat fox monkey rabbit tags
0 0 1 1 1 fox monkey rabbit
1 0 0 1 1 monkey rabbit
2 0 1 0 1 fox rabbit
3 0 0 1 1 monkey rabbit
4 1 0 0 0 cat
Explanation: We iterate over the columns on the dataframe.
df.ix[selection criteria, columns to write value] = value
df.ix[df[col_name]==1,'tags']= df['tags']+' '+col_name
The above line basically finds you all the places where df[col_name] == 1, selects column 'tags' and set it to the RHS value which is df['tags']+' '+ col_name
Note: .ix
has been deprecated since Pandas v0.20. You should instead use .loc
or .iloc
, as appropriate.
Upvotes: 10
Reputation: 294228
I'd do:
cols = df.columns.to_series().values
pd.DataFrame(np.repeat(cols[None, :], len(df), 0)[df.astype(bool).values], df.index[df.any(1)])
MaxU's method has edge for large dataframes
Small df
5 x 3
Large df
1000000 x 52
Upvotes: 3
Reputation: 25639
Try this:
df = pd.DataFrame({"monkey":[0,1,0,1,0],"rabbit":[1,0,0,0,0],"fox":[0,0,1,0,0], "cat":[0,0,0,0,1]})
df
cat fox monkey rabbit
0 0 0 0 1
1 0 0 1 0
2 0 1 0 0
3 0 0 1 0
4 1 0 0 0
pd.DataFrame([x for x in np.where(df ==1, df.columns,'').flatten().tolist() if len(x) >0],columns= (["animal"]) )
animal
0 rabbit
1 monkey
2 fox
3 monkey
4 cat
Upvotes: 0
Reputation: 259
I would use apply to decode the columns:
In [2]: animals = pd.DataFrame({"monkey":[0,1,0,0,0],"rabbit":[1,0,0,0,0],"fox":[0,0,1,0,0]})
In [3]: def get_animal(row):
...: for c in animals.columns:
...: if row[c]==1:
...: return c
In [4]: animals.apply(get_animal, axis=1)
Out[4]:
0 rabbit
1 monkey
2 fox
3 None
4 None
dtype: object
Upvotes: 16