beapen
beapen

Reputation: 30

One Hot Encoding a composite field

I want to transform multiple columns with same categorical values using a OneHotEncoder. I created a composite field and tried to use OneHotEncoder on it as below: (Items 1-3 are from the same list of items)

import pyspark.sql.functions as F

df = df.withColumn("basket", myConcat("item1", "item2", "item3")) 
indexer = StringIndexer(inputCol="basket", outputCol="basketIndex")
indexed = indexer.fit(df).transform(df)
encoder = OneHotEncoder(setInputCol="basketIndex", setOutputCol="basketVec")

encoded = encoder.transform(indexed)

def myConcat(*cols):
    return F.concat(*[F.coalesce(c, F.lit("*")) for c in cols])

I am getting an out of memory error.

Does this approach work? How do I one hot encode a composite field or multiple columns with categorical values from same list?

Upvotes: 1

Views: 786

Answers (2)

hamza tuna
hamza tuna

Reputation: 1497

If you have categorical values array why you didn't try CountVectorizer:

import pyspark.sql.functions as F
from pyspark.ml.feature import CountVectorizer

df = df.withColumn("basket", myConcat("item1", "item2", "item3")) 
indexer = CountVectorizer(inputCol="basket", outputCol="basketIndex")
indexed = indexer.fit(df).transform(df)

Upvotes: 2

Edward Aung
Edward Aung

Reputation: 3512

Note: I can't comment yet (due to the fact that I'm a new user).

What is the cardinality of your "item1", "item2" and "item3"

More specifically, what are the values that the following prints is giving ?

k1 = df.item1.nunique()
k2 = df.item2.nunique()
k3 = df.item3.nunique()
k = k1 * k2 * k3
print (k1, k2, k3)

One hot encoding is basically creating a very sparse matrix of same number of rows as your original dataframe with k number of additional columns, where k = products of the three numbers printed above.

Therefore, if your 3 numbers are large, you get out of memory error.

The only solutions are to:

(1) increase your memory or (2) introduce a hierarchy among the categories and use the higher level categories to limit k.

Upvotes: 0

Related Questions