Reputation: 435
I have a DataFrame that I have processed to be like:
+---------+-------+
| inputs | temp |
+---------+-------+
| [1,0,0] | 12 |
+---------+-------+
| [0,1,0] | 10 |
+---------+-------+
...
inputs
is a column of DenseVectors. temp
is a column of values. I want to append the DenseVector with these values and create one column, but I am not sure how to start. Any tips for this desired output:
+---------------+
| inputsMerged |
+---------------+
| [1,0,0,12] |
+---------------+
| [0,1,0,10] |
+---------------+
...
EDIT: I am trying to use the VectorAssembler
method but my resulting array is not as intended.
Upvotes: 2
Views: 1742
Reputation: 215107
You might do something like this:
df.show()
+-------------+----+
| inputs|temp|
+-------------+----+
|[1.0,0.0,0.0]| 12|
|[0.0,1.0,0.0]| 10|
+-------------+----+
df.printSchema()
root
|-- inputs: vector (nullable = true)
|-- temp: long (nullable = true)
Import:
import pyspark.sql.functions as F
from pyspark.ml.linalg import Vectors, VectorUDT
Create the udf to merge the Vector and element:
concat = F.udf(lambda v, e: Vectors.dense(list(v) + [e]), VectorUDT())
Apply udf to inputs and temp columns:
merged_df = df.select(concat(df.inputs, df.temp).alias('inputsMerged'))
merged_df.show()
+------------------+
| inputsMerged|
+------------------+
|[1.0,0.0,0.0,12.0]|
|[0.0,1.0,0.0,10.0]|
+------------------+
merged_df.printSchema()
root
|-- inputsMerged: vector (nullable = true)
Upvotes: 2