Reputation: 3365
I'm learning Pyspark, I used it to read a csv file into a dataframe:
>>> df = spark.read.option("header",True).csv('example.csv')
>>> df.show(n=4)
+-------+------+------+
|main_id| id| price|
+-------+------+------+
| 100|aaaaa1|190000|
| 101| bbbbb|216000|
| 100|aaaaa2|276000|
| 100|aaaaa3|340000|
+-------+------+------+
only showing top 4 rows
How can I group data based on the first column main_id
and group the id
and price
in an array, then convert the dataframe to a newline delimited json format? Something like:
{"main_id": "100", "items": [{"id": "aaaaa1", "price": 190000},{"id": "aaaaa2", "price": 276000},{"id": "aaaaa3", "price": 340000}]}
{"main_id": "101", "id": "bbbbb", "price": 216000}
...
...
Upvotes: 1
Views: 339
Reputation: 23119
You can use struct
and groupby
with collect_list
as
from pyspark.sql import functions as f
df1.select(f.col("main_id"), f.struct(f.col("id"), f.struct("price")).alias("items"))\
.groupby("main_id")\
.agg(f.collect_list("items").alias("items"))
Or:
from pyspark.sql import functions as f
df1.groupby("main_id") \
.agg(f.collect_list("id").alias("id"), f.collect_list("price").alias("price"))\
.select("main_id", f.arrays_zip(f.col("id"), f.col("price")).alias("items"))
Output:
{"main_id":100,"items":[{"id":"aaaaa1","col2":{"price":190000}},{"id":"aaaaa2","col2":{"price":276000}},{"id":"aaaaa3","col2":{"price":340000}}]}
{"main_id":101,"items":[{"id":"bbbbb","col2":{"price":216000}}]}
Upvotes: 1