Reputation: 89

How could I do one hot encoding with multiple values in one cell?

I have this table in Excel:

id  class
0   2 3
1   1 3 
2   3 5

Now, I want to do a 'special' one-hot encoding in Python. For each id in the first table, there are two numbers. Each number corresponds to a class (class1, class2, etc.). The second table is created based off of the first such that for each id, each number in its row shows up in its corresponding class column and the other columns just get zeros. For example, the numbers for id 0 are 2 and 3. The 2 is placed at class2 and the 3 is placed at class3. Classes 1, 4, and 5 get the default of 0. The result should be like:

id  class1  class2  class3  class4  class5
 0   0       2        3       0       0
 1   1       0        3       0       0
 2   0       0        3       0       5

My previous solution,

foo = lambda x: pd.Series([i for i in x.split()])
result=onehot['hotel'].apply(foo)
result.columns=['class1','class2']
pd.get_dummies(result, prefix='class', columns=['class1','class2'])

results in:

    class_1 class_2 class_3 class_3 class_5
  0  0.0     1.0    0.0      1.0    0.0
  1  1.0     0.0    0.0      1.0    0.0
  2  0.0     0.0    1.0      0.0    1.0

(class_3 appears twice). What can I do to fix this? (After this step, I can transform it to the final format I want.)

Upvotes: 4

Answers (4)

OmaymaS

Reputation: 1731

What about this?

Given this data

import pandas as pd
df = pd.DataFrame({'id': [0, 1, 2], 'class': ['2 3', '1 3', '3 5']})

1- split values

df['class'] = df['class'].apply(lambda x: x.split(' '))

df
   id   class
0   0  [2, 3]
1   1  [1, 3]
2   2  [3, 5]

2- explode --> each record in a row

df_long = df.explode('class')

df_long
   id class
0   0     2
0   0     3
1   1     1
1   1     3
2   2     3
2   2     5

3- get one hot encoded values

df_one_hot_encoded = pd.concat([df, pd.get_dummies(df_long['class'],prefix='class', prefix_sep='_')], axis=1)

df_one_hot_encoded
   id   class  class_1  class_2  class_3  class_5
0   0  [2, 3]        0        1        0        0
0   0  [2, 3]        0        0        1        0
1   1  [1, 3]        1        0        0        0
1   1  [1, 3]        0        0        1        0
2   2  [3, 5]        0        0        1        0
2   2  [3, 5]        0        0        0        1

4- groupby id and get the max value per column (same result of logical OR for binary values) --> one row per id

df_one_hot_encoded.groupby('id').max().reset_index()

   id   class  class_1  class_2  class_3  class_5
0   0  [2, 3]        0        1        1        0
1   1  [1, 3]        1        0        1        0
2   2  [3, 5]        0        0        1        1

Bringing all together

import pandas as pd
df = pd.DataFrame({'id': [0, 1, 2], 'class': ['2 3', '1 3', '3 5']})
df['class'] = df['class'].apply(lambda x: x.split(' '))
df_long = df.explode('class')
df_one_hot_encoded = pd.concat([df, pd.get_dummies(df_long['class'],prefix='class', prefix_sep='_')], axis=1)
df_one_hot_encoded_compact = df_one_hot_encoded.groupby('id').max().reset_index()

Upvotes: 3

rrauenza

Reputation: 6993

Does this satisfy your problem as stated?

#!/usr/bin/python

input = [
    (0, (2,3)),
    (1, (1,3)),
    (2, (3,5)),
]

maximum = max(reduce(lambda x, y: x+list(y[1]), input, []))
# Or ...
# maximum = 0
# for i, classes in input:
#    maximum = max(maximum, *classes)

# print header.
print "\t".join(["id"] + ["class_%d" % i for i in range(1, 6)])

for i, classes in input:
    print i,
    for r in range(1, maximum+1):
        print "\t",
        if r in classes:
            print float(r),
        else:
            print 0.0,
    print

Output:

id      class_1 class_2 class_3 class_4 class_5
0       0.0     2.0     3.0     0.0     0.0
1       1.0     0.0     3.0     0.0     0.0
2       0.0     0.0     3.0     0.0     5.0

Upvotes: 4

Nickil Maveli

Reputation: 29719

You need to make your variables to be categorical and then you can use one hot encoding as shown:

In [18]: df1 = pd.DataFrame({"class":pd.Series(['2','1','3']).astype('category',categories=['1','2','3','4','5'])})

In [19]: df2 = pd.DataFrame({"class":pd.Series(['3','3','5']).astype('category',categories=['1','2','3','4','5'])})

In [20]: df_1 = pd.get_dummies(df1)

In [21]: df_2 = pd.get_dummies(df2)

In [22]: df_1.add(df_2).apply(lambda x: x * [i for i in range(1,len(df_1.columns)+1)], axis = 1).astype(int).rename_axis('id')
Out[22]: 
    class_1  class_2  class_3  class_4  class_5
id                                             
0         0        2        3        0        0
1         1        0        3        0        0
2         0        0        3        0        5

Upvotes: 4

Will McGinnis

Reputation: 189

It may be simpler to split the original dataframe into 3 columns:

id  class_a class_b
0   2          3
1   1          3  
2   3          5

And then perform a normal one-hot encoding on that. Afterwards you may end up with duplicates of columns like:

id  ... class_a_3 class_b_3 ... class_b_5
0          0          1             0
1          0          1             0
2          1          0             0

But you can merge/sum those after the fact pretty simply.

Likewise, you could pivot the same logic and transform your df into the form:

Then one-hot that, and aggregate using sum on the key id.

Upvotes: 3

How could I do one hot encoding with multiple values in one cell?

Answers (4)

Related Questions