Reputation: 1690
Is there a function in Spark to do Categorical data encoding. Ex:
Var1,Var2,Var3
1,2,a
2,3,b
3,2,c
To
var1,var2,var3
1,2,0
2,3,1
3,2,2
a -> 0, b->1, c->2
Upvotes: 1
Views: 545
Reputation: 1
df = pd.read_csv("file.csv",keep_default_na=False)
df = spark.createDataFrame(df)
types = df1.dtypes
arity = {}
i=0
for c,t in types:
if(t == 'string'):
arity[i] = len(df1.select(c).distinct().collect())
print arity[i],i,c
i+=1
This arity dictionary can act as your input to categoricalFeaturesInfo
Upvotes: 0
Reputation: 41
Use this function for Categorical data encoding.
def get_mapping(rdd, idx):
return rdd.map(lambda x: x[idx]).distinct().zipWithIndex().collectAsMap()
val categories = rdd.map(r => r(2)).distinct.collect.zipWithIndex.toMap
Upvotes: 3