piotrm50
piotrm50

Reputation: 53

Dimension mismatch error in Spark ML

I'm pretty new to both ML and Spark ML, and I'm trying to make a prediction model using neural networks with Spark ML, but I get this error when i call .transform method on my learnt model. The problem is caused by the use of OneHotEncoder, because without it everything works fine. I have tried taking OneHotEncoder out of the pipeline.

My question is: how can I use OneHotEncoder and not get this error?

 java.lang.IllegalArgumentException: requirement failed: A & B Dimension mismatch! 
 at scala.Predef$.require(Predef.scala:224)     at org.apache.spark.ml.ann.BreezeUtil$.dgemm(BreezeUtil.scala:41)   at
 org.apache.spark.ml.ann.AffineLayerModel.eval(Layer.scala:163)     at
 org.apache.spark.ml.ann.FeedForwardModel.forward(Layer.scala:482)  at
 org.apache.spark.ml.ann.FeedForwardModel.predict(Layer.scala:529)

My code:

test_pandas_df = pd.read_csv(
    '/home/piotrek/ml/adults/adult.test', names=header, skipinitialspace=True)
train_pandas_df = pd.read_csv(
    '/home/piotrek/ml/adults/adult.data', names=header, skipinitialspace=True)
train_df = sqlContext.createDataFrame(train_pandas_df)
test_df = sqlContext.createDataFrame(test_pandas_df)

joined = train_df.union(test_df)

assembler = VectorAssembler().setInputCols(features).setOutputCol("features")

label_indexer = StringIndexer().setInputCol(
    "label").setOutputCol("label_index")

label_indexer_fit = [label_indexer.fit(joined)]

string_indexers = [StringIndexer().setInputCol(
    name).setOutputCol(name + "_index").fit(joined) for name in categorical_feats]

one_hot_pipeline = Pipeline().setStages([OneHotEncoder().setInputCol(
    name + '_index').setOutputCol(name + '_one_hot') for name in categorical_feats])

mlp = MultilayerPerceptronClassifier().setLabelCol(label_indexer.getOutputCol()).setFeaturesCol(
    assembler.getOutputCol()).setLayers([len(features), 20, 10, 2]).setSeed(42L).setBlockSize(1000).setMaxIter(500)
pipeline = Pipeline().setStages(label_indexer_fit
                                + string_indexers + [one_hot_pipeline] + [assembler, mlp])

model = pipeline.fit(train_df)

# compute accuracy on the test set
result = model.transform(test_df)

## FAILS ON RESULT

predictionAndLabels = result.select("prediction", "label_index")

evaluator = MulticlassClassificationEvaluator(labelCol="label_index")
print "-------------------------------"
print("Test set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))
print "-------------------------------"

Thanks!

Upvotes: 5

Views: 3292

Answers (2)

yeamusic21
yeamusic21

Reputation: 385

I had the same issue and took a more manual approach to what user6910411 suggested. So for example I had

layers = [**100**, 100, 100 ,100] 

but my number of input variables was actually 199, so I just changed to

layers = [**199**, 100, 100 ,100] 

and the problem appeared to resolve. :-D

Upvotes: -1

zero323
zero323

Reputation: 330063

layers Param in your model is not correct:

setLayers([len(features), 20, 10, 2])

The first layer should reflect the number of the input features which in general won't be the same as the number of raw columns before encoding.

If you don't know the total number of features up front you can for example separate feature extraction and model training. Pseudocode:

feature_pipeline_model = (Pipeline()
     .setStages(...)  # Only feature extraction
     .fit(train_df))

train_df_features = feature_pipeline_model.transform(train_df)
layers = [
    train_df_features.schema["features"].metadata["ml_attr"]["num_attrs"],
    20, 10, 2
]

Upvotes: 6

Related Questions