Reputation: 441
The following is an example from Databricks with my own data, I can't get the VectorAssembler
transformation working.
string_indexer = StringIndexer(inputCol='ptype', outputCol='index_ptype', handleInvalid="skip")
string_indexer_model = string_indexer.fit(sample_df)
indexed_df = string_indexer_model.transform(sample_df)
ohe = OneHotEncoder(inputCol='index_ptype', outputCol='ohe_ptype', handleInvalid="keep")
ohe_model = ohe.fit(indexed_df)
ohe_df = ohe_model.transform(indexed_df)
ohe_df.show()
+-----+-----------+-------------+
|ptype|index_ptype| ohe_ptype|
+-----+-----------+-------------+
| 5.0| 2.0|(6,[2],[1.0])|
| 7.0| 4.0|(6,[4],[1.0])|
| 3.0| 1.0|(6,[1],[1.0])|
| 1.0| 0.0|(6,[0],[1.0])|
| 6.0| 3.0|(6,[3],[1.0])|
| 8.0| 5.0|(6,[5],[1.0])|
+-----+-----------+-------------+
assembler = VectorAssembler(inputCols=['ohe_ptype'], outputCol="features")
result_df_dense = assembler.transform(ohe_df)
result_df_dense.show(truncate=False)
+-----+-----------+-------------+-------------+
|ptype|index_ptype|ohe_ptype |features |
+-----+-----------+-------------+-------------+
|5.0 |2.0 |(6,[2],[1.0])|(6,[2],[1.0])|
|7.0 |4.0 |(6,[4],[1.0])|(6,[4],[1.0])|
|3.0 |1.0 |(6,[1],[1.0])|(6,[1],[1.0])|
|1.0 |0.0 |(6,[0],[1.0])|(6,[0],[1.0])|
|6.0 |3.0 |(6,[3],[1.0])|(6,[3],[1.0])|
|8.0 |5.0 |(6,[5],[1.0])|(6,[5],[1.0])|
+-----+-----------+-------------+-------------+
As seen, my features are exactly the same as ohe_ptype!
I expect to get something like this:
+-----+-----------+-------------+-------------------------+
|ptype|index_ptype|ohe_ptype |features |
+-----+-----------+-------------+-------------------------+
|5.0 |2.0 |(6,[2],[1.0])|[0.0,0.0,1.0,0.0,0.0,0.0]|
Upvotes: 0
Views: 15
Reputation: 441
I finaly manage to use tensorflow transformation to finish the job.
first turn it to numpy array:
feature = np.array(df.select("vector_features_scaled").rdd.map(lambda row: row[0].toArray()).collect())
and turn it to a tensor and voila!
feature = tf.convert_to_tensor(feature, dtype="float32")
Upvotes: 0