Wrong vector size of OneHotEncoder in pyspark

Question

I have tried to check the output of the OneHotEncoder in pyspark. I read in forums and documentation of the encoder that, size of the encoded vector will be equal to the number of distinct values in the column that is being encoded.

from pyspark.ml.feature import OneHotEncoder, StringIndexer

df = sqlContext.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
], ["id", "category"])

stringIndexer = StringIndexer(inputCol="category",   outputCol="categoryIndex")

model = stringIndexer.fit(df)

indexed = model.transform(df)

encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")

encoded = encoder.transform(indexed)
encoded.show()

The following is the result of the above code

+---+--------+--------------+-------------+
| id|category|categoryIndex|  categoryVec|
+---+--------+--------------+-------------+
|  0|       a|           0.0|(2,[0],[1.0])|
|  1|       b|           2.0|    (2,[],[])|
|  2|       c|           1.0|(2,[1],[1.0])|
|  3|       a|           0.0|(2,[0],[1.0])|
|  4|       a|           0.0|(2,[0],[1.0])|
|  5|       c|           1.0|(2,[1],[1.0])|
+---+--------+--------------+-------------+

As per the interpretation of the categoryVec column, the size of the vector is 2. Whereas, the number of distinct values in the column "category" is 3, i.e. a,b and c. Please let me understand what is that i am missing here.

pault · Accepted Answer

From the docs for pyspark.ml.feature.OneHotEncoder:

class pyspark.ml.feature.OneHotEncoder(dropLast=True, inputCol=None, outputCol=None)
A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast) because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].

So for n categories, you will have an output vector of size n-1 unless you set dropLast to False. There is nothing wrong or strange about this- it's that you only need n-1 indices to uniquely map all categories.

Wrong vector size of OneHotEncoder in pyspark

Answers (1)

Related Questions