Jonathan Roy
Jonathan Roy

Reputation: 441

Why Is There No Change Using The pyspark.ml Feature VectorAssembler?

The following is an example from Databricks with my own data, I can't get the VectorAssembler transformation working.

string_indexer = StringIndexer(inputCol='ptype', outputCol='index_ptype', handleInvalid="skip")
string_indexer_model = string_indexer.fit(sample_df)
indexed_df = string_indexer_model.transform(sample_df)

ohe = OneHotEncoder(inputCol='index_ptype', outputCol='ohe_ptype', handleInvalid="keep")
ohe_model = ohe.fit(indexed_df)
ohe_df = ohe_model.transform(indexed_df)
ohe_df.show()

+-----+-----------+-------------+
|ptype|index_ptype|    ohe_ptype|
+-----+-----------+-------------+
|  5.0|        2.0|(6,[2],[1.0])|
|  7.0|        4.0|(6,[4],[1.0])|
|  3.0|        1.0|(6,[1],[1.0])|
|  1.0|        0.0|(6,[0],[1.0])|
|  6.0|        3.0|(6,[3],[1.0])|
|  8.0|        5.0|(6,[5],[1.0])|
+-----+-----------+-------------+

assembler = VectorAssembler(inputCols=['ohe_ptype'], outputCol="features")
result_df_dense = assembler.transform(ohe_df)

result_df_dense.show(truncate=False)

+-----+-----------+-------------+-------------+
|ptype|index_ptype|ohe_ptype    |features     |
+-----+-----------+-------------+-------------+
|5.0  |2.0        |(6,[2],[1.0])|(6,[2],[1.0])|
|7.0  |4.0        |(6,[4],[1.0])|(6,[4],[1.0])|
|3.0  |1.0        |(6,[1],[1.0])|(6,[1],[1.0])|
|1.0  |0.0        |(6,[0],[1.0])|(6,[0],[1.0])|
|6.0  |3.0        |(6,[3],[1.0])|(6,[3],[1.0])|
|8.0  |5.0        |(6,[5],[1.0])|(6,[5],[1.0])|
+-----+-----------+-------------+-------------+

As seen, my features are exactly the same as ohe_ptype!

I expect to get something like this:

+-----+-----------+-------------+-------------------------+
|ptype|index_ptype|ohe_ptype    |features                 |
+-----+-----------+-------------+-------------------------+
|5.0  |2.0        |(6,[2],[1.0])|[0.0,0.0,1.0,0.0,0.0,0.0]|

Upvotes: 0

Views: 15

Answers (1)

Jonathan Roy
Jonathan Roy

Reputation: 441

I finaly manage to use tensorflow transformation to finish the job.

first turn it to numpy array:

feature = np.array(df.select("vector_features_scaled").rdd.map(lambda row: row[0].toArray()).collect())

and turn it to a tensor and voila!

feature = tf.convert_to_tensor(feature, dtype="float32")

Upvotes: 0

Related Questions