Corey
Corey

Reputation: 1855

Why does Spark's OneHotEncoder drop the last category by default?

I would like to understand the rational behind the Spark's OneHotEncoder dropping the last category by default.

For example:

>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>> ff = ss.fit(fd).transform(fd)
>>> ff.show()
+----+---+-----+
|   x|  c|c_idx|
+----+---+-----+
| 1.0|  a|  0.0|
| 1.5|  a|  0.0|
|10.0|  b|  1.0|
| 3.2|  c|  2.0|
+----+---+-----+

By default, the OneHotEncoder will drop the last category:

>>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec")
>>> fe = oe.transform(ff)
>>> fe.show()
+----+---+-----+-------------+
|   x|  c|c_idx|    c_idx_vec|
+----+---+-----+-------------+
| 1.0|  a|  0.0|(2,[0],[1.0])|
| 1.5|  a|  0.0|(2,[0],[1.0])|
|10.0|  b|  1.0|(2,[1],[1.0])|
| 3.2|  c|  2.0|    (2,[],[])|
+----+---+-----+-------------+

Of course, this behavior can be changed:

>>> oe.setDropLast(False)
>>> fl = oe.transform(ff)
>>> fl.show()
+----+---+-----+-------------+
|   x|  c|c_idx|    c_idx_vec|
+----+---+-----+-------------+
| 1.0|  a|  0.0|(3,[0],[1.0])|
| 1.5|  a|  0.0|(3,[0],[1.0])|
|10.0|  b|  1.0|(3,[1],[1.0])|
| 3.2|  c|  2.0|(3,[2],[1.0])|
+----+---+-----+-------------+

Question::

The last category is not included by default (configurable via dropLast) because it makes the vector entries sum up to one, and hence linearly dependent.

Upvotes: 19

Views: 6093

Answers (1)

Romain Jouin
Romain Jouin

Reputation: 4848

According to the doc it is to keep the column independents:

A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via OneHotEncoder!.dropLast because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0]. Note that this is different from scikit-learn's OneHotEncoder, which keeps all categories. The output vectors are sparse.

https://spark.apache.org/docs/1.5.2/api/java/org/apache/spark/ml/feature/OneHotEncoder.html

Upvotes: 11

Related Questions