Abhishek Sharma
Abhishek Sharma

Reputation: 1

How to change the data type of a string column to double in spark as a stage in a pipeline?

I am creating pipeline with stages as:

Array(Some_Indexer, Some_Encoder, Some_Assembler)

The assembler is a VectorAssembler that does not support StringType. How can I create one more stage in the pipeline to convert the datatype of string columns to double values?

Upvotes: 0

Views: 2373

Answers (1)

JamCon
JamCon

Reputation: 2333

The StringIndexer (import org.apache.spark.ml.feature.StringIndexer) is what you're looking for. The documentation link describing it: StringIndexer

Here's an example using the Titanic dataset. The Sex and Embark fields are categorical, and need to be converted to numerics.



Sample Code:

import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{OneHotEncoder,StringIndexer,VectorAssembler,VectorIndexer}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.Pipeline

val training = spark.read.option("header","true").option("inferSchema","true").format("csv").load("train.csv")

// Convert the categorical (string) values into numeric values
val genderIndexer = new StringIndexer().setInputCol("Sex").setOutputCol("SexIndex")
val embarkIndexer = new StringIndexer().setInputCol("Embarked").setOutputCol("EmbarkIndex")

// Convert the numerical index columns into One Hot columns
// The One Hot columns are binary {0,1} values of the categories
val genderEncoder = new OneHotEncoder().setInputCol("SexIndex").setOutputCol("SexVec")
val embarkEncoder = new OneHotEncoder().setInputCol("EmbarkIndex").setOutputCol("EmbarkVec")

// Create the vector structured data (label,features(vector))
val assembler = new VectorAssembler().setInputCols(Array("Pclass","SexVec","Age","SibSp","Parch","Fare","EmbarkVec")).setOutputCol("features")

// Create the Logistic Regression instance
val lr = new LogisticRegression().setMaxIter(100).setRegParam(0.3).setElasticNetParam(0.8)

// Create the model pipeline
val pipeline = new Pipeline().setStages(Array(genderIndexer,embarkIndexer,genderEncoder,embarkEncoder,assembler,lr))

// Create the Logistic Regression model by fitting the training data
val lrModel = pipeline.fit(training)

// Score the data
val results = lrModel.transform(test)



Sample Data:

training.show(5,false)

+-----------+--------+------+---------------------------------------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|Name                                               |Sex   |Age |SibSp|Parch|Ticket          |Fare   |Cabin|Embarked|
+-----------+--------+------+---------------------------------------------------+------+----+-----+-----+----------------+-------+-----+--------+
|1          |0       |3     |Braund, Mr. Owen Harris                            |male  |22.0|1    |0    |A/5 21171       |7.25   |null |S       |
|2          |1       |1     |Cumings, Mrs. John Bradley (Florence Briggs Thayer)|female|38.0|1    |0    |PC 17599        |71.2833|C85  |C       |
|3          |1       |3     |Heikkinen, Miss. Laina                             |female|26.0|0    |0    |STON/O2. 3101282|7.925  |null |S       |
|4          |1       |1     |Futrelle, Mrs. Jacques Heath (Lily May Peel)       |female|35.0|1    |0    |113803          |53.1   |C123 |S       |
|5          |0       |3     |Allen, Mr. William Henry                           |male  |35.0|0    |0    |373450          |8.05   |null |S       |
+-----------+--------+------+---------------------------------------------------+------+----+-----+-----+----------------+-------+-----+--------+
only showing top 5 rows



Sample Results:

results.show(5,false)
+-----+------+-------------------------------+----+----+-----+-----+--------+--------+--------+-----------+-------------+-------------+---------------------------------------+----------------------------------------+----------------------------------------+----------+
|label|Pclass|Name                           |Sex |Age |SibSp|Parch|Fare    |Embarked|SexIndex|EmbarkIndex|SexVec       |EmbarkVec    |features                               |rawPrediction                           |probability                             |prediction|
+-----+------+-------------------------------+----+----+-----+-----+--------+--------+--------+-----------+-------------+-------------+---------------------------------------+----------------------------------------+----------------------------------------+----------+
|0    |1     |Baxter, Mr. Quigg Edmond       |male|24.0|0    |1    |247.5208|C       |0.0     |1.0        |(1,[0],[1.0])|(2,[1],[1.0])|[1.0,1.0,24.0,0.0,1.0,247.5208,0.0,1.0]|[0.4221428237360974,-0.4221428237360974]|[0.6039958948910183,0.39600410510898165]|0.0       |
|0    |1     |Blackwell, Mr. Stephen Weart   |male|45.0|0    |0    |35.5    |S       |0.0     |0.0        |(1,[0],[1.0])|(2,[0],[1.0])|[1.0,1.0,45.0,0.0,0.0,35.5,1.0,0.0]    |[0.4221428237360974,-0.4221428237360974]|[0.6039958948910183,0.39600410510898165]|0.0       |
|0    |1     |Carlsson, Mr. Frans Olof       |male|33.0|0    |0    |5.0     |S       |0.0     |0.0        |(1,[0],[1.0])|(2,[0],[1.0])|[1.0,1.0,33.0,0.0,0.0,5.0,1.0,0.0]     |[0.4221428237360974,-0.4221428237360974]|[0.6039958948910183,0.39600410510898165]|0.0       |
|0    |1     |Carrau, Mr. Francisco M        |male|28.0|0    |0    |47.1    |S       |0.0     |0.0        |(1,[0],[1.0])|(2,[0],[1.0])|[1.0,1.0,28.0,0.0,0.0,47.1,1.0,0.0]    |[0.4221428237360974,-0.4221428237360974]|[0.6039958948910183,0.39600410510898165]|0.0       |
|0    |1     |Foreman, Mr. Benjamin Laventall|male|30.0|0    |0    |27.75   |C       |0.0     |1.0        |(1,[0],[1.0])|(2,[1],[1.0])|[1.0,1.0,30.0,0.0,0.0,27.75,0.0,1.0]   |[0.4221428237360974,-0.4221428237360974]|[0.6039958948910183,0.39600410510898165]|0.0       |
+-----+------+-------------------------------+----+----+-----+-----+--------+--------+--------+-----------+-------------+-------------+---------------------------------------+----------------------------------------+----------------------------------------+----------+
only showing top 5 rows

Upvotes: 1

Related Questions