mommomonthewind
mommomonthewind

Reputation: 4640

What is the correct way to use pyspark VectorAssembler?

I am trying to combine all feature columns into a single one

So:

assembler = VectorAssembler(
    inputCols=feature_list,
    outputCol='features')

In which:

feature_list is a Python list that contains all the feature column names

Then

trainingData = assembler.transform(df)

But when I did:

enter image description here

What is the correct way to use VectorAssembler?

Many thanks

Upvotes: 4

Views: 15017

Answers (1)

pissall
pissall

Reputation: 7399

Without the stack trace or the df example, it's hard to understand your issue.

But I'd still answer it, according to the documentation:

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

dataset = spark.createDataFrame(
    [(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0)],
    ["id", "hour", "mobile", "userFeatures", "clicked"])

dataset.show()

# +---+----+------+--------------+-------+
# | id|hour|mobile|  userFeatures|clicked|
# +---+----+------+--------------+-------+
# |  0|  18|   1.0|[0.0,10.0,0.5]|    1.0|
# +---+----+------+--------------+-------+

assembler = VectorAssembler(
    inputCols=["hour", "mobile", "userFeatures"],
    outputCol="features")

output = assembler.transform(dataset)

print("Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'")

output.select("features", "clicked").show(truncate=False)

# +-----------------------+-------+
# |features               |clicked|
# +-----------------------+-------+
# |[18.0,1.0,0.0,10.0,0.5]|1.0    |
# +-----------------------+-------+

Example Source Code

Upvotes: 9

Related Questions