PySpark Pipeline Error when using Indexer and Encoder

Question

I'm using the bank data from UCI to just template out a project. I was following the PySpark tutorial on on their documentation site (sorry cant find the link anymore). I keep getting an error when running through the pipeline. I've loaded the data, converted feature types, and have done the pipelining for categorical and numerical features. I'd love any feedback on any part of the code but specifically where I'm getting the error so I can continue with this build out. Thank you in advance!

Sample Data

+---+---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+-------+
| id|age|       job|marital|education|default|balance|housing|loan|contact|day|month|duration|campaign|pdays|previous|poutcome|deposit|
+---+---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+-------+
|  1| 59|    admin.|married|secondary|     no|   2343|    yes|  no|unknown|  5|  may|    1042|       1|   -1|       0| unknown|    yes|
|  2| 56|    admin.|married|secondary|     no|     45|     no|  no|unknown|  5|  may|    1467|       1|   -1|       0| unknown|    yes|
|  3| 41|technician|married|secondary|     no|   1270|    yes|  no|unknown|  5|  may|    1389|       1|   -1|       0| unknown|    yes|
|  4| 55|  services|married|secondary|     no|   2476|    yes|  no|unknown|  5|  may|     579|       1|   -1|       0| unknown|    yes|
|  5| 54|    admin.|married| tertiary|     no|    184|     no|  no|unknown|  5|  may|     673|       2|   -1|       0| unknown|    yes|
+---+---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+-------+
only showing top 5 rows

# Convert Feature Types
df.createOrReplaceTempView("df")

df2 = spark.sql("select \
                    cast(id as int) as id, \
                    cast(age as int) as age, \
                    cast(job as string) as job, \
                    cast(marital as string) as marital, \
                    cast(education as string) as education, \
                    cast(default as string) as default, \
                    cast(balance as int) as balance, \
                    cast(housing as string) as housing, \
                    cast(loan as string) as loan, \
                    cast(contact as string) as contact, \
                    cast(day as int) as day, \
                    cast(month as string) as month, \
                    cast(duration as int) as duration, \
                    cast(campaign as int) as campaign, \
                    cast(pdays as int) as pdays, \
                    cast(previous as int) as previous, \
                    cast(poutcome as string) as poutcome, \
                    cast(deposit as string) as deposit \
                from df")

# Data Types
df2.dtypes

[('id', 'int'),
 ('age', 'int'),
 ('job', 'string'),
 ('marital', 'string'),
 ('education', 'string'),
 ('default', 'string'),
 ('balance', 'int'),
 ('housing', 'string'),
 ('loan', 'string'),
 ('contact', 'string'),
 ('day', 'int'),
 ('month', 'string'),
 ('duration', 'int'),
 ('campaign', 'int'),
 ('pdays', 'int'),
 ('previous', 'int'),
 ('poutcome', 'string'),
 ('deposit', 'string')]


 # Build Pipeline (Error is Here)
categorical_cols = ["job","marital","education","default","housing","loan","contact","month","poutcome"]
numeric_cols = ["age", "balance", "day", "duration", "campaign", "pdays","previous"]

stages = []

stringIndexer = StringIndexer(inputCol=[cols for cols in categorical_cols],
                              outputCol=[cols + "_index" for cols in categorical_cols])

encoder = OneHotEncoderEstimator(inputCols=[cols + "_index" for cols in categorical_cols],
                                 outputCols=[cols + "_classVec" for cols in categorical_cols])

stages += [stringIndexer, encoder]

label_string_id = StringIndexer(inputCol="deposit", outputCol="label")
stages += [label_string_id]

assembler_inputs = [cols + "_classVec" for cols in categorical_cols] + numeric_cols
assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="features")
stages += [assembler]

# Run Data Through Pipeline
pipeline = Pipeline().setStages(stages)
pipeline_model = pipeline.fit(df2)
prepped_df = pipeline_model.transform(df2)

Error

"TypeError: Invalid param value given for param "inputCols". Could not convert job_index to list of strings"

user10938362 · Accepted Answer

That's because OneHotEncoderEstimator (unlike legacy OneHotEncoder) takes multiple columns and yields multiple columns (please note that both parameters are plural - Cols not Col). So you should either wrap each call with a list,

for cols in categorical_cols:
    ...
    encoder = OneHotEncoderEstimator(
      inputCols=[cols + "_index"], outputCols=[cols + "_classVec"]
    )
    ...

or better pass all columns at the same time, outside the for loop:

encoder = OneHotEncoderEstimator(
    inputCols=[col + "_index" for cols in categorical_cols], 
    outputCols=[col + "_classVec" for for col in categorical_cols]
)
stages += [encoder]

If you're in doubt what is the expected input / output you can always inspect the corresponding Param:

from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer

OneHotEncoderEstimator.inputCols.typeConverter
## 

StringIndexer.inputCol.typeConverter
##

As you can see the former one requires object coercible to a list of strings, while the latter one just a string.

PySpark Pipeline Error when using Indexer and Encoder

Sample Data

Error

Answers (1)

Related Questions