Reputation: 343
My problem is based on the similar question here PySpark: Add a new column with a tuple created from columns, with the difference that I have a list of values instead of one value per column. For example:
from pyspark.sql import Row
df = sqlContext.createDataFrame([Row(v1=[u'2.0', u'1.0', u'9.0'], v2=[u'9.0', u'7.0', u'2.0']),Row(v1=[u'4.0', u'8.0', u'9.0'], v2=[u'1.0', u'1.0', u'2.0'])])
+---------------+---------------+
| v1| v2|
+---------------+---------------+
|[2.0, 1.0, 9.0]|[9.0, 7.0, 2.0]|
|[2.0, 1.0, 9.0]|[9.0, 7.0, 2.0]|
+---------------+---------------+
What I am trying to get is something similar like zip element-wise for the lists per row, but I cant figure it out in pyspark 1.6:
+---------------+---------------+--------------------+
| v1| v2| v_tuple|
+---------------+---------------+--------------------+
|[2.0, 1.0, 9.0]|[9.0, 7.0, 2.0]|[(2.0,9.0), (1.0,...|
|[4.0, 8.0, 9.0]|[1.0, 1.0, 2.0]|[(4.0,1.0), (8.0,...|
+---------------+---------------+--------------------+
Note: The size of the arrays may vary from row to row, but it is always the same for the same row column-wise.
Upvotes: 2
Views: 7275
Reputation: 35249
If size of the arrays varies from row to row you'll need and UDF:
from pyspark.sql.functions import udf
@udf("array<struct<_1:double,_2:double>>")
def zip_(xs, ys):
return list(zip(xs, ys))
df.withColumn("v_tuple", zip_("v1", "v2"))
In Spark 1.6:
from pyspark.sql.types import *
zip_ = udf(
lambda xs, ys: list(zip(xs, ys)),
ArrayType(StructType([StructField("_1", DoubleType()), StructField("_2", DoubleType())])))
Upvotes: 3