btt3165
btt3165

Reputation: 41

pyspark - Generate json from grouped data

I am trying to group data from df and generate json object out of the grouped data. I thought I was on the right track but looking at the result I do not think I am generating a proper json object.

Source data

 df1 = sqlContext.createDataFrame([
      ("20210924155828",'2021-09-17 13:55:35','2021-08-15 00:00:00','PRODUCT_04_20210815.DAT', '0.057','PRODUCT'), 
      ("20210924155828",'2021-09-17 13:55:35','2021-08-15 00:00:00','PRODUCT_00_20210815.DAT', '0.068','PRODUCT'), 
      ("20210924155828",'2021-09-17 13:55:35','2021-08-15 00:00:00','PRODUCT_01_20210815.DAT', '0.089','PRODUCT'), 
      ("20210924155828",'2021-09-17 13:55:35','2021-08-15 00:00:00','PRODUCT_03_20210815.DAT', '0.100','PRODUCT')
      ], ["ID", "FileReceivedTimestamp", "SourceDataTimestamp","SourceFileName", "FileSize","FileName"])

    df1.show()
    +--------------+---------------------+-------------------+--------------------+--------+--------+
    |            ID|FileReceivedTimestamp|SourceDataTimestamp|      SourceFileName|FileSize|FileName|
    +--------------+---------------------+-------------------+--------------------+--------+--------+
    |20210924155828|  2021-09-17 13:55:35|2021-08-15 00:00:00|PRODUCT_04_202108...|   0.057| PRODUCT|
    |20210924155828|  2021-09-17 13:55:35|2021-08-15 00:00:00|PRODUCT_00_202108...|   0.068| PRODUCT|
    |20210924155828|  2021-09-17 13:55:35|2021-08-15 00:00:00|PRODUCT_01_202108...|   0.089| PRODUCT|
    |20210924155828|  2021-09-17 13:55:35|2021-08-15 00:00:00|PRODUCT_03_202108...|   0.100| PRODUCT|
    +--------------+---------------------+-------------------+--------------------+--------+--------+

utilized to_json, groupBy, agg and collect_list to generate json object.

df2 = (df1.select("FileName","ID",to_json(struct("SourceFileName","FileReceivedTimestamp","FileSize")).alias("metadata"))
   .groupby("FileName","ID").agg(collect_list(col("metadata")).alias("jsonmetadata")))

I see below using display in databricks

enter image description here

to query and test my output i did a explode on the generated json. I get this error when trying to view SourceFileName "Can't extract value from MetadataArray#779: need struct type but got string;"

display(df2.select(explode(col("jsonmetadata")).alias("MetadataArray")
                  ,col("MetadataArray.SourceFileName").alias("SourceFileName")
                  ))

Am I missing something here.... was hoping to query the data after exploding using MetadataArray.SourceFileName, MetadataArray.FileSize etc...

Edit:
My expectation was to get jsonmetadata as array of struct instead of array of string. removing to_json helped resolve my issue.

df2 = (df1.select("FileName","ID",(struct("SourceFileName","FileReceivedTimestamp","FileSize")).alias("metadata"))
   .groupby("FileName","ID").agg(collect_list(col("metadata")).alias("jsonmetadata")))

Schema after removing to_json conversion.

root |-- FileName: string (nullable = true) |-- ID: string (nullable = true) |-- jsonmetadata: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- SourceFileName: string (nullable = true) | | |-- FileReceivedTimestamp: string (nullable = true) | | |-- FileSize: string (nullable = true)

enter image description here

Upvotes: 1

Views: 160

Answers (1)

AdibP
AdibP

Reputation: 2939

Since MetadataArray values are json strings, you need to parse them using from_json function with the correct schema

from pyspark.sql.functions import col, explode, from_json

sch = "SourceFileName string, FileReceivedTimestamp string, FileSize string"

df2\
.select(explode(col("jsonmetadata")).alias("MetadataArray"), 
        from_json(col("MetadataArray"), sch).getField("SourceFileName").alias("SourceFileName"))\
.show()

# +--------------------+--------------------+
# |       MetadataArray|      SourceFileName|
# +--------------------+--------------------+
# |{"SourceFileName"...|PRODUCT_04_202108...|
# |{"SourceFileName"...|PRODUCT_00_202108...|
# |{"SourceFileName"...|PRODUCT_01_202108...|
# |{"SourceFileName"...|PRODUCT_03_202108...|
# +--------------------+--------------------+

Upvotes: 2

Related Questions