Reputation: 1103
I work on a dataframe with three columns, colA, colB and colC
+---+-----+-----+-----+
|id |colA |colB |colC |
+---+-----+-----+-----+
| 1 | 5 | 8 | 3 |
| 2 | 9 | 7 | 4 |
| 3 | 3 | 0 | 6 |
| 4 | 1 | 6 | 7 |
+---+-----+-----+-----+
I need to merge the colA, colB and colC columns to get a new dataFrame like that below:
+---+--------------+
|id | colD |
+---+--------------+
| 1 | [5, 8, 3] |
| 2 | [9, 7, 4] |
| 3 | [3, 0, 6] |
| 4 | [1, 6, 7] |
+---+--------------+
That is the pyspark code to obtain the first DataFrame:
l=[(1,5,8,3),(2,9,7,4), (3,3,0,6), (4,1,6,7)]
names=["id","colA","colB","colC"]
db=sqlContext.createDataFrame(l,names)
db.show()
How do I convert rows to Vectors? Could anyone help me, please? Thanks
Upvotes: 2
Views: 3952
Reputation: 2590
It actually slightly depends on what data type you want for colD
. If you want a VectorUDT
column, then using the VectorAssembler
is the correct transformation. If you just want the fields combined into an array, then a UDF is unnecessary. You can use the built-in array
function to combine columns:
>>> from pyspark.sql.functions import array
>>> db.select('id',array('colA','colB','colC').alias('colD')).show()
+---+---------+
| id| colD|
+---+---------+
| 1|[5, 8, 3]|
| 2|[9, 7, 4]|
| 3|[3, 0, 6]|
| 4|[1, 6, 7]|
+---+---------+
This will actually give a performance boost over the other transformations because pyspark doesn't have to serialize your udf.
Upvotes: 1
Reputation: 5880
you can use vectorassembler from pyspark.ml,
from pyspark.ml.feature import VectorAssembler
newdb = VectorAssembler(inputCols=["colA", "colB", "colC"], outputCol="colD").transform(db)
newdb.show()
+---+----+----+----+-------------+
| id|colA|colB|colC| colD|
+---+----+----+----+-------------+
| 1| 5| 8| 3|[5.0,8.0,3.0]|
| 2| 9| 7| 4|[9.0,7.0,4.0]|
| 3| 3| 0| 6|[3.0,0.0,6.0]|
| 4| 1| 6| 7|[1.0,6.0,7.0]|
+---+----+----+----+-------------+
or if you want, can use udf to do row wise composition,
from pyspark.sql import functions as F
from pyspark.sql.types import *
udf1 = F.udf(lambda x,y,z : [x,y,z],ArrayType(IntegerType()))
df.select("id",udf1("colA","colB","colC").alias("colD")).show()
+---+---------+
| id| colD|
+---+---------+
| 1|[5, 8, 3]|
| 2|[9, 7, 4]|
| 3|[3, 0, 6]|
| 4|[1, 6, 7]|
+---+---------+
Hope this helps.!
Upvotes: 2