jswtraveler
jswtraveler

Reputation: 355

javanullpointerexception after df.na.fill("Missing") in scala?

I've been trying to learn/use Scala for machine learning and to do that I need to convert string variables to an index of dummies.

The way I've done it is with the StringIndexer in Scala. Before running I've used df.na.fill("missing") to replace missing values. Even after I run that I still get a NullPointerException.

Is there something else I should be doing or something else I should be checking? I used printSchema to filter only on the string columns to get the list of columns I needed to run StringIndexer on.

val newDf1 = reweight.na.fill("Missing")


val cat_cols = Array("highest_tier_nm", "day_of_week", "month", 
  "provided", "docsis", "dwelling_type_grp", "dwelling_type_cd", "market" 
  "bulk_flag")

val transformers: Array[org.apache.spark.ml.PipelineStage] = cat_cols
  .map(cname => new StringIndexer() 
        .setInputCol(cname)
        .setOutputCol(s"${cname}_index"))

val stages: Array[org.apache.spark.ml.PipelineStage] = transformers
val categorical = new Pipeline().setStages(stages) 
val cat_reweight = categorical.fit(newDf)

Upvotes: 1

Views: 248

Answers (1)

Shaido
Shaido

Reputation: 28367

Normally when using machine learning you would train the model with one part of the data and then test it with another part. Hence, there are two different methods to use to reflect this. You have only used fit() which is equivalent to training a model (or a pipeline).

This mean that your cat_reweight is not a dataframe, it is a PipelineModel. A PipelineModel have a function transform() that takes data with the same format as the one used for training and gives a dataframe as output. In other words, you should add .transform(newDf1) after fit(newDf1).


Another possible issue is that in your code you have used fit(newDf) instead of fit(newDf1). Make sure the correct dataframe is used for both the fit() and transform() methods, otherwise you will get a NullPointerException.

It works for me when running locally, however, if you still get an error you could try to cache() after replacing the nulls and then performing an action to make sure all transformations are done.

Hope it helps!

Upvotes: 1

Related Questions