Jason Donnald
Jason Donnald

Reputation: 2316

How to pass categorical features to Linear Regression modeling in PySpark MLlib?

I was working on Linear Regression modeling in PySpark and have a doubt regrading that. The data that I have has categorical features. I went through the documentation on PySpark and the example for Linear Regression shows this:

model = LinearRegressionWithSGD.train(parsedData)

It does not show how to pass the categorical features to Linear Regression. I have worked on Random Forest in PySpark before where I first encoded categorical features and then passed these features to the model as Random Forest provides a parameter to specify the categorical features. The Linear Regression does not show any such parameter in the documentation.

Can anyone help me by providing me the way to pass categorical feature to Linear Regression modeling in PySpark MLlib

Upvotes: 4

Views: 4193

Answers (1)

groceryheist
groceryheist

Reputation: 1747

You need to use a '''VectorAssembler''' to build a "features" column. "features" is the default name of the features column so in the univariate case you do LinearRegression(featurescol="catvar"). Here is a walkthrough of the whole process, assuming you started with strVar in a dataFrame df.

Step 1: Build an index that maps to the string variable.

from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.ml.feature import VectorAssembler

varIdxer = StringIndexer(inputCol='strVar',outputCol='varIdx').fit(df)
df = varIdxer.transform(df)

Step 2: Encode the categorical variable as a sequence of binary variables using a OneHotEncoder

df = OneHotEncoder(inputCol="varIdx", outputCol="varCat").transform(df)

Step 3: Create the "features" col using a VectorAssembler.

assembler = VectorAssembler(inputCols=["varCat"],outputCol="features")
df =  assembler.transform(df)

Step 4: Fit the model (I have only tested with LinearRegression).

lr = LinearRegression(labelCol='y',featuresCol='features')

Upvotes: 7

Related Questions