Reputation: 445
I am trying to build a neural network using pyspark.ml. The problem is that I am using onehotencoder and other pre-processing methods to transform the categorical variables. The stages in my pipeline are:
But the problem is that I don't know the number of features after the step 4 to give it to "layers" of the classifier in step 5. My question is that how can I obtain the final number of features? here is my code, I did not include the import and data loading part.
stages = []
for c in Categories:
stringIndexer = StringIndexer(inputCol= c , outputCol=c + "_indexed")
encoder = OneHotEncoder(inputCol= c + "_indexed", outputCol=c + "_categoryVec")
stages += [stringIndexer, encoder]
labelIndexer = StringIndexer(inputCol="Target", outputCol="indexedLabel")
final_features = list(map(lambda c: c+"_categoryVec", Categories))+Continuous
assembler = VectorAssembler(
inputCols= final_features,
outputCol="features")
pca = PCA(k=20, inputCol="features", outputCol="pcaFeatures")
(train_val, test_val) = train.randomSplit([0.95, 0.05])
num_classes= train.select("Target").distinct().count()
NN= MultilayerPerceptronClassifier(labelCol="indexedLabel", featuresCol='pcaFeatures', maxIter=100,
layers=[????, 5, 5, num_classes], blockSize=10, seed=1234)
stages += [labelIndexer]
stages += [assembler]
stages += [pca]
stages += [NN]
pipeline = Pipeline(stages=stages)
model = pipeline.fit(train_val)
Upvotes: 1
Views: 349
Reputation: 43534
From the docs, the input parameter k
is the number of principal components.
So in your case:
pca = PCA(k=20, inputCol="features", outputCol="pcaFeatures")
The number of features is 20.
Update
Another way to do it would be to look at the length of one of the assembled vectors.
For example, if you wanted the length after Step 3:
from pyspark.sql.functions import udf, col
nfeatures = assembler.withColumn('len', udf(len, IntegerType())(col('features'))\
.select('len').take(1)
I feel like there should be a better way to do this, i.e. without having to call take()
.
Upvotes: 2