Reputation: 89
I have this table in Excel:
id class
0 2 3
1 1 3
2 3 5
Now, I want to do a 'special' one-hot encoding in Python. For each id in the first table, there are two numbers. Each number corresponds to a class (class1, class2, etc.). The second table is created based off of the first such that for each id, each number in its row shows up in its corresponding class column and the other columns just get zeros. For example, the numbers for id 0 are 2 and 3. The 2 is placed at class2 and the 3 is placed at class3. Classes 1, 4, and 5 get the default of 0. The result should be like:
id class1 class2 class3 class4 class5
0 0 2 3 0 0
1 1 0 3 0 0
2 0 0 3 0 5
My previous solution,
foo = lambda x: pd.Series([i for i in x.split()])
result=onehot['hotel'].apply(foo)
result.columns=['class1','class2']
pd.get_dummies(result, prefix='class', columns=['class1','class2'])
results in:
class_1 class_2 class_3 class_3 class_5
0 0.0 1.0 0.0 1.0 0.0
1 1.0 0.0 0.0 1.0 0.0
2 0.0 0.0 1.0 0.0 1.0
(class_3 appears twice). What can I do to fix this? (After this step, I can transform it to the final format I want.)
Upvotes: 4
Views: 10397
Reputation: 1731
What about this?
Given this data
import pandas as pd
df = pd.DataFrame({'id': [0, 1, 2], 'class': ['2 3', '1 3', '3 5']})
1- split values
df['class'] = df['class'].apply(lambda x: x.split(' '))
df
id class
0 0 [2, 3]
1 1 [1, 3]
2 2 [3, 5]
2- explode --> each record in a row
df_long = df.explode('class')
df_long
id class
0 0 2
0 0 3
1 1 1
1 1 3
2 2 3
2 2 5
3- get one hot encoded values
df_one_hot_encoded = pd.concat([df, pd.get_dummies(df_long['class'],prefix='class', prefix_sep='_')], axis=1)
df_one_hot_encoded
id class class_1 class_2 class_3 class_5
0 0 [2, 3] 0 1 0 0
0 0 [2, 3] 0 0 1 0
1 1 [1, 3] 1 0 0 0
1 1 [1, 3] 0 0 1 0
2 2 [3, 5] 0 0 1 0
2 2 [3, 5] 0 0 0 1
4- groupby id
and get the max value per column (same result of logical OR for binary values) --> one row per id
df_one_hot_encoded.groupby('id').max().reset_index()
id class class_1 class_2 class_3 class_5
0 0 [2, 3] 0 1 1 0
1 1 [1, 3] 1 0 1 0
2 2 [3, 5] 0 0 1 1
Bringing all together
import pandas as pd
df = pd.DataFrame({'id': [0, 1, 2], 'class': ['2 3', '1 3', '3 5']})
df['class'] = df['class'].apply(lambda x: x.split(' '))
df_long = df.explode('class')
df_one_hot_encoded = pd.concat([df, pd.get_dummies(df_long['class'],prefix='class', prefix_sep='_')], axis=1)
df_one_hot_encoded_compact = df_one_hot_encoded.groupby('id').max().reset_index()
Upvotes: 3
Reputation: 6993
Does this satisfy your problem as stated?
#!/usr/bin/python
input = [
(0, (2,3)),
(1, (1,3)),
(2, (3,5)),
]
maximum = max(reduce(lambda x, y: x+list(y[1]), input, []))
# Or ...
# maximum = 0
# for i, classes in input:
# maximum = max(maximum, *classes)
# print header.
print "\t".join(["id"] + ["class_%d" % i for i in range(1, 6)])
for i, classes in input:
print i,
for r in range(1, maximum+1):
print "\t",
if r in classes:
print float(r),
else:
print 0.0,
print
Output:
id class_1 class_2 class_3 class_4 class_5
0 0.0 2.0 3.0 0.0 0.0
1 1.0 0.0 3.0 0.0 0.0
2 0.0 0.0 3.0 0.0 5.0
Upvotes: 4
Reputation: 29719
You need to make your variables to be categorical
and then you can use one hot encoding
as shown:
In [18]: df1 = pd.DataFrame({"class":pd.Series(['2','1','3']).astype('category',categories=['1','2','3','4','5'])})
In [19]: df2 = pd.DataFrame({"class":pd.Series(['3','3','5']).astype('category',categories=['1','2','3','4','5'])})
In [20]: df_1 = pd.get_dummies(df1)
In [21]: df_2 = pd.get_dummies(df2)
In [22]: df_1.add(df_2).apply(lambda x: x * [i for i in range(1,len(df_1.columns)+1)], axis = 1).astype(int).rename_axis('id')
Out[22]:
class_1 class_2 class_3 class_4 class_5
id
0 0 2 3 0 0
1 1 0 3 0 0
2 0 0 3 0 5
Upvotes: 4
Reputation: 189
It may be simpler to split the original dataframe into 3 columns:
id class_a class_b
0 2 3
1 1 3
2 3 5
And then perform a normal one-hot encoding on that. Afterwards you may end up with duplicates of columns like:
id ... class_a_3 class_b_3 ... class_b_5
0 0 1 0
1 0 1 0
2 1 0 0
But you can merge/sum those after the fact pretty simply.
Likewise, you could pivot the same logic and transform your df into the form:
id class
0 2
0 3
1 1
1 3
2 3
2 5
Then one-hot that, and aggregate using sum on the key id.
Upvotes: 3