Reputation: 30
I want to transform multiple columns with same categorical values using a OneHotEncoder
. I created a composite field and tried to use OneHotEncoder
on it as below: (Items 1-3 are from the same list of items)
import pyspark.sql.functions as F
df = df.withColumn("basket", myConcat("item1", "item2", "item3"))
indexer = StringIndexer(inputCol="basket", outputCol="basketIndex")
indexed = indexer.fit(df).transform(df)
encoder = OneHotEncoder(setInputCol="basketIndex", setOutputCol="basketVec")
encoded = encoder.transform(indexed)
def myConcat(*cols):
return F.concat(*[F.coalesce(c, F.lit("*")) for c in cols])
I am getting an out of memory error.
Does this approach work? How do I one hot encode a composite field or multiple columns with categorical values from same list?
Upvotes: 1
Views: 786
Reputation: 1497
If you have categorical values array why you didn't try CountVectorizer:
import pyspark.sql.functions as F
from pyspark.ml.feature import CountVectorizer
df = df.withColumn("basket", myConcat("item1", "item2", "item3"))
indexer = CountVectorizer(inputCol="basket", outputCol="basketIndex")
indexed = indexer.fit(df).transform(df)
Upvotes: 2
Reputation: 3512
Note: I can't comment yet (due to the fact that I'm a new user).
What is the cardinality of your "item1", "item2" and "item3"
More specifically, what are the values that the following prints is giving ?
k1 = df.item1.nunique()
k2 = df.item2.nunique()
k3 = df.item3.nunique()
k = k1 * k2 * k3
print (k1, k2, k3)
One hot encoding is basically creating a very sparse matrix of same number of rows as your original dataframe with k number of additional columns, where k = products of the three numbers printed above.
Therefore, if your 3 numbers are large, you get out of memory error.
The only solutions are to:
(1) increase your memory or (2) introduce a hierarchy among the categories and use the higher level categories to limit k.
Upvotes: 0