Pyspark grouping , joining and covert to json

Question

I Have two spark Df.

User Df
- userId
- UserName
- Address
Order Df
- UserId
- ProductName
- ProductDesc
- CategoryName
- CategoryId
- CategoryDesc
- Price

Sample data: User Df

+------+----+--------+
|userId|name|Addreshh|
+------+----+--------+
|     1|Sufi|   Reons|
|     2|Ragu|  Random|
+------+----+--------+

Order df

+------+-----------+-----------+------------+----------+------------+-----+
|userId|ProductName|ProductDesc|CategoryName|CategoryId|CategoryDesc|Price|
+------+-----------+-----------+------------+----------+------------+-----+
|     1|         A1|      A1Dec|           A|         1|        Adec|    5|
|     1|         A2|      A2Dec|           A|         1|        Adec|   10|
|     1|         B1|      A1Dec|           B|         2|        Bdec|   11|
|     2|         B4|      A4Dec|           B|         2|        Bdec|   15|
+------+-----------+-----------+------------+----------+------------+-----+

I need to group and aggregate(create nested schema) order df and join with user df. Then create a json file for each record

eg:- Json 1

{
      "userId": 1,
      "neme": "sufi",
      "address": "Reons",
      "order": [
        {
          "name": "A1",
          "price": 5,
          "category": {
            "Id": 1,
            "name": "A",
            "desc": "ADesc"
          }
        },
        {
          "name": "A2",
          "price": 10,
          "category": {
            "Id": 1,
            "name": "A",
            "desc": "ADesc"
          }
        },
        {
          "name": "B1",
          "price": 11,
          "category": {
            "Id": 2,
            "name": "B",
            "desc": "BDesc"
          }
        }
      ]
    }

mck · Accepted Answer

Join the two dataframes and use collect_list to collect the orders for each user. Write json files as output and partition it using userId. There will be two folders created for each userId, and each folder will contain one json file. Spark can't rename the files or move them, so you'll probably need some os operations to rename/move them as you wish.

import pyspark.sql.functions as F

orderdf2 = orderdf.select('userId',
    F.struct(
        F.col('ProductName').alias('name'),
        F.col('Price').alias('price'),
        F.struct(
            F.col('CategoryId').alias('Id'),
            F.col('CategoryName').alias('name'),
            F.col('CategoryDesc').alias('desc')
        ).alias('category')
    ).alias('order')
).groupBy('userId').agg(
    F.collect_list('order').alias('order')
)

userdf.join(
    orderdf2, 'userId'
).groupBy(
    'userId','name','address'
).agg(
    F.collect_list('order').alias('order')
).write.partitionBy('userId').json('result')

==> userId=1/part-00144-845806db-0700-4585-bb45-01648432abc1.c000.json <==
{"name":"Sufi","address":"Reons","order":[{"name":"A1","price":5,"category":{"Id":"1","name":"A","desc":"Adec"}},{"name":"A2","price":10,"category":{"Id":"1","name":"A","desc":"Adec"}},{"name":"B1","price":11,"category":{"Id":"2","name":"B","desc":"Bdec"}}]}

==> userId=2/part-00189-845806db-0700-4585-bb45-01648432abc1.c000.json <==
{"name":"Ragu","address":"Random","order":[{"name":"B4","price":15,"category":{"Id":"2","name":"B","desc":"Bdec"}}]}

Pyspark grouping , joining and covert to json

Answers (2)

Related Questions