Converting multiple spark dataframe columns to a single column with list type

Question

I have a Pyspark dataframe like below.

Time           A      B     C     D

06:37:14       2      3     4     5

And I want to convert it to like this for all the rows. I don't want to use Pandas to get this done. The new column type should be a list type.

Time             Features

06:37:14        [2,3,4,5]

How can I do this using Pyspark?

gaw · Accepted Answer

As I have described in the comment, when you have a fixed number of columns, which you know in advance you can simply combine the values in a new column with withColumn and if you want an array you can use array

df1= sqlContext.createDataFrame([("06:37:14", '2', '3', '4', '5')], ['Time', 'A', 'B', 'C', 'D'])
df1.withColumn("Features", array("A","B","C","D")).drop("A","B","C","D").show(truncate=False)

Output:

+--------+------------+
|Time    |Features    |
+--------+------------+
|06:37:14|[2, 3, 4, 5]|
+--------+------------+

Converting multiple spark dataframe columns to a single column with list type

Answers (1)

Related Questions