Reputation: 99
I have a pyspark dataframe, where I want to group by some index, and combine all the values in each column into one list
per column.
Example input:
id_1| id_2| id_3|timestamp|thing1|thing2|thing3
A | b | c |time_0 |1.2 |1.3 |2.5
A | b | c |time_1 |1.1 |1.5 |3.4
A | b | c |time_2 |2.2 |2.6 |2.9
A | b | d |time_0 |5.1 |5.5 |5.7
A | b | d |time_1 |6.1 |6.2 |6.3
A | b | e |time_0 |0.1 |0.5 |0.9
A | b | e |time_1 |0.2 |0.3 |0.6
Example output:
id_1|id_2|id_3| timestamp |thing1 |thing2 |thing3
A |b | c |[time_0,time_1,time_2]|[1.2,1.1,2.2]|[1.3,1.5,2.6|[2.5,3.4,2.9]
A |b | d |[time_0,time_1] |[5.1,6.1] |[5.5,6.2] |[5.7,6.3]
A |b | e |[time_0,time_1] |[0.1,0.2] |[0.5,0.3] |[0.9,0.6]
How can I do this efficiently?
Upvotes: 0
Views: 1361
Reputation: 7585
Use collect_list()
as people have suggested above as well.
# Creating the DataFrame
df =sqlContext.createDataFrame([('A','b','c','time_0',1.2,1.3,2.5),('A','b','c','time_1',1.1,1.5,3.4),
('A','b','c','time_2',2.2,2.6,2.9),('A','b','d','time_0',5.1,5.5,5.7),
('A','b', 'd','time_1',6.1,6.2,6.3),('A','b','e','time_0',0.1,0.5,0.9),
('A','b', 'e','time_1',0.2,0.3,0.6)],
['id_1','id_2','id_3','timestamp','thing1','thing2','thing3'])
df.show()
+----+----+----+---------+------+------+------+
|id_1|id_2|id_3|timestamp|thing1|thing2|thing3|
+----+----+----+---------+------+------+------+
| A| b| c| time_0| 1.2| 1.3| 2.5|
| A| b| c| time_1| 1.1| 1.5| 3.4|
| A| b| c| time_2| 2.2| 2.6| 2.9|
| A| b| d| time_0| 5.1| 5.5| 5.7|
| A| b| d| time_1| 6.1| 6.2| 6.3|
| A| b| e| time_0| 0.1| 0.5| 0.9|
| A| b| e| time_1| 0.2| 0.3| 0.6|
+----+----+----+---------+------+------+------+
In addition to using agg()
, you can write familiar SQL
syntax to operate on it, but first we have to register our DataFrame
as temporary SQL
view -
df.createOrReplaceTempView("df_view")
df = spark.sql("""select id_1, id_2, id_3,
collect_list(timestamp) as timestamp,
collect_list(thing1) as thing1,
collect_list(thing2) as thing2,
collect_list(thing3) as thing3
from df_view
group by id_1, id_2, id_3""")
df.show(truncate=False)
+----+----+----+------------------------+---------------+---------------+---------------+
|id_1|id_2|id_3|timestamp |thing1 |thing2 |thing3 |
+----+----+----+------------------------+---------------+---------------+---------------+
|A |b |d |[time_0, time_1] |[5.1, 6.1] |[5.5, 6.2] |[5.7, 6.3] |
|A |b |e |[time_0, time_1] |[0.1, 0.2] |[0.5, 0.3] |[0.9, 0.6] |
|A |b |c |[time_0, time_1, time_2]|[1.2, 1.1, 2.2]|[1.3, 1.5, 2.6]|[2.5, 3.4, 2.9]|
+----+----+----+------------------------+---------------+---------------+---------------+
Note: The """
has been used to have multiline statements for the sake of visibility and neatness. With simple 'select id_1 ....'
that wouldn't work if you try to spread your statement over multiple lines. Needless to say, the final result will be the same.
Upvotes: 2
Reputation: 2685
Here is the example github TestExample1
exampleDf = self.spark.createDataFrame(
[('A', 'b', 'c', 'time_0', 1.2, 1.3, 2.5),
('A', 'b', 'c', 'time_1', 1.1, 1.5, 3.4),
],
("id_1", "id_2", "id_3", "timestamp", "thing1", "thing2", "thing3"))
exampleDf.show()
ans = exampleDf.groupBy(col("id_1"), col("id_2"), col("id_3")) \
.agg(collect_list(col("timestamp")),
collect_list(col("thing1")),
collect_list(col("thing2")))
ans.show()
+----+----+----+---------+------+------+------+
|id_1|id_2|id_3|timestamp|thing1|thing2|thing3|
+----+----+----+---------+------+------+------+
| A| b| c| time_0| 1.2| 1.3| 2.5|
| A| b| c| time_1| 1.1| 1.5| 3.4|
+----+----+----+---------+------+------+------+
+----+----+----+-----------------------+--------------------+--------------------+
|id_1|id_2|id_3|collect_list(timestamp)|collect_list(thing1)|collect_list(thing2)|
+----+----+----+-----------------------+--------------------+--------------------+
| A| b| c| [time_0, time_1]| [1.2, 1.1]| [1.3, 1.5]|
+----+----+----+-----------------------+--------------------+--------------------+
Upvotes: 0