Reputation: 2316
I was working on Linear Regression
modeling in PySpark
and have a doubt regrading that. The data that I have has categorical features
. I went through the documentation on PySpark
and the example for Linear Regression
shows this:
model = LinearRegressionWithSGD.train(parsedData)
It does not show how to pass the categorical features
to Linear Regression
. I have worked on Random Forest
in PySpark
before where I first encoded
categorical features
and then passed these features to the model as Random Forest
provides a parameter
to specify the categorical features
. The Linear Regression
does not show any such parameter in the documentation.
Can anyone help me by providing me the way to pass categorical feature
to Linear Regression
modeling in PySpark
MLlib
Upvotes: 4
Views: 4193
Reputation: 1747
You need to use a '''VectorAssembler''' to build a "features" column. "features" is the default name of the features column so in the univariate case you do LinearRegression(featurescol="catvar")
. Here is a walkthrough of the whole process, assuming you started with strVar
in a dataFrame df
.
Step 1: Build an index that maps to the string variable.
from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.ml.feature import VectorAssembler
varIdxer = StringIndexer(inputCol='strVar',outputCol='varIdx').fit(df)
df = varIdxer.transform(df)
Step 2: Encode the categorical variable as a sequence of binary variables using a OneHotEncoder
df = OneHotEncoder(inputCol="varIdx", outputCol="varCat").transform(df)
Step 3: Create the "features" col using a VectorAssembler
.
assembler = VectorAssembler(inputCols=["varCat"],outputCol="features")
df = assembler.transform(df)
Step 4: Fit the model (I have only tested with LinearRegression).
lr = LinearRegression(labelCol='y',featuresCol='features')
Upvotes: 7