How to obtain the number of features after preprocessing to use pyspark.ml neural network classifier?

Question

I am trying to build a neural network using pyspark.ml. The problem is that I am using onehotencoder and other pre-processing methods to transform the categorical variables. The stages in my pipeline are:

indexing the categorical features
using Onehotencoder
using Vector assembler
then I apply PCA
giving the "pcaFeatures" to a neural network classifier

But the problem is that I don't know the number of features after the step 4 to give it to "layers" of the classifier in step 5. My question is that how can I obtain the final number of features? here is my code, I did not include the import and data loading part.

stages = []
for c in Categories:
    stringIndexer = StringIndexer(inputCol= c , outputCol=c + "_indexed")
    encoder = OneHotEncoder(inputCol= c + "_indexed", outputCol=c + "_categoryVec")
    stages += [stringIndexer, encoder]

labelIndexer = StringIndexer(inputCol="Target", outputCol="indexedLabel")

final_features = list(map(lambda c: c+"_categoryVec", Categories))+Continuous


assembler = VectorAssembler(
    inputCols= final_features,
    outputCol="features")

pca = PCA(k=20, inputCol="features", outputCol="pcaFeatures")
(train_val, test_val) = train.randomSplit([0.95, 0.05])

num_classes= train.select("Target").distinct().count()

NN= MultilayerPerceptronClassifier(labelCol="indexedLabel", featuresCol='pcaFeatures', maxIter=100,
                                    layers=[????, 5, 5, num_classes], blockSize=10, seed=1234)


stages += [labelIndexer]
stages += [assembler]
stages += [pca]
stages += [NN]

pipeline = Pipeline(stages=stages)
model = pipeline.fit(train_val)

How to obtain the number of features after preprocessing to use pyspark.ml neural network classifier?

Answers (1)

Related Questions