Joel
Joel

Reputation: 1690

Spark Categorical Data Encoding

Is there a function in Spark to do Categorical data encoding. Ex:

Var1,Var2,Var3
1,2,a
2,3,b
3,2,c

To

var1,var2,var3
1,2,0
2,3,1
3,2,2

a -> 0, b->1, c->2

Upvotes: 1

Views: 545

Answers (2)

RTG
RTG

Reputation: 1

Python - PySpark 2.0.0 +

df = pd.read_csv("file.csv",keep_default_na=False)
df = spark.createDataFrame(df)
types =  df1.dtypes
arity = {}
i=0
for c,t in types:
    if(t == 'string'):
        arity[i] = len(df1.select(c).distinct().collect())
        print arity[i],i,c
    i+=1

This arity dictionary can act as your input to categoricalFeaturesInfo

Upvotes: 0

kalyan padhiloju
kalyan padhiloju

Reputation: 41

Use this function for Categorical data encoding.

python

def get_mapping(rdd, idx):
    return rdd.map(lambda x: x[idx]).distinct().zipWithIndex().collectAsMap()

Scala

val categories = rdd.map(r => r(2)).distinct.collect.zipWithIndex.toMap

Upvotes: 3

Related Questions