PySpark: How do I convert rows to Vectors?

Question

I work on a dataframe with three columns, colA, colB and colC

+---+-----+-----+-----+
|id |colA |colB |colC |
+---+-----+-----+-----+
| 1 |  5  | 8   | 3   |
| 2 |  9  | 7   | 4   |
| 3 |  3  | 0   | 6   |
| 4 |  1  | 6   | 7   |
+---+-----+-----+-----+

I need to merge the colA, colB and colC columns to get a new dataFrame like that below:

+---+--------------+
|id |     colD     |
+---+--------------+
| 1 |  [5, 8, 3]   |
| 2 |  [9, 7, 4]   |
| 3 |  [3, 0, 6]   |
| 4 |  [1, 6, 7]   |
+---+--------------+

That is the pyspark code to obtain the first DataFrame:

l=[(1,5,8,3),(2,9,7,4), (3,3,0,6), (4,1,6,7)]
names=["id","colA","colB","colC"]
db=sqlContext.createDataFrame(l,names)
db.show()

How do I convert rows to Vectors? Could anyone help me, please? Thanks

DavidWayne · Accepted Answer

It actually slightly depends on what data type you want for colD. If you want a VectorUDT column, then using the VectorAssembler is the correct transformation. If you just want the fields combined into an array, then a UDF is unnecessary. You can use the built-in array function to combine columns:

>>> from pyspark.sql.functions import array
>>> db.select('id',array('colA','colB','colC').alias('colD')).show()

+---+---------+
| id|     colD|
+---+---------+
|  1|[5, 8, 3]|
|  2|[9, 7, 4]|
|  3|[3, 0, 6]|
|  4|[1, 6, 7]|
+---+---------+

This will actually give a performance boost over the other transformations because pyspark doesn't have to serialize your udf.

PySpark: How do I convert rows to Vectors?

Answers (2)

Related Questions