Reputation: 55
I have a Pyspark dataframe like below.
Time A B C D
06:37:14 2 3 4 5
And I want to convert it to like this for all the rows. I don't want to use Pandas to get this done. The new column type should be a list type.
Time Features
06:37:14 [2,3,4,5]
How can I do this using Pyspark?
Upvotes: 0
Views: 1830
Reputation: 1960
As I have described in the comment, when you have a fixed number of columns, which you know in advance you can simply combine the values in a new column with withColumn
and if you want an array you can use array
df1= sqlContext.createDataFrame([("06:37:14", '2', '3', '4', '5')], ['Time', 'A', 'B', 'C', 'D'])
df1.withColumn("Features", array("A","B","C","D")).drop("A","B","C","D").show(truncate=False)
Output:
+--------+------------+
|Time |Features |
+--------+------------+
|06:37:14|[2, 3, 4, 5]|
+--------+------------+
Upvotes: 4