Reputation: 75
I'm working with pyspark and I have the following code that creates a nested json file from a dataframe with some fields (product, quantity, from, to) nested in "requirements". Herunder the code that creates the json an one row as example
final2 = final.groupby('identifier', 'plant', 'family', 'familyDescription', 'type', 'name', 'description', 'batchSize', 'phantom', 'makeOrBuy', 'safetyStock', 'unit', 'unitPrice', 'version').agg(F.collect_list(F.struct(F.col("product"), F.col("quantity"), F.col("from"), F.col("to"))).alias('requirements'))
{"identifier":"xxx","plant":"xxxx","family":"xxxx","familyDescription":"xxxx","type":"assembled","name":"xxxx","description":"xxxx","batchSize":20.0,"phantom":"False","makeOrBuy":"make","safetyStock":0.0,"unit":"PZ","unitPrice":xxxx,"version":"0001","requirements":[{"product":"yyyy","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"},{"product":"zzzz","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"},{"product":"kkkk","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"},{"product":"wwww","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"},{"product":"bbbb","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"}]}
The schema of the final2 dataframe is the following:
|-- identifier: string (nullable = true)
|-- plant: string (nullable = true)
|-- family: string (nullable = true)
|-- familyDescription: string (nullable = true)
|-- type: string (nullable = false)
|-- name: string (nullable = true)
|-- description: string (nullable = true)
|-- batchSize: double (nullable = true)
|-- phantom: string (nullable = false)
|-- makeOrBuy: string (nullable = false)
|-- safetyStock: double (nullable = true)
|-- unit: string (nullable = true)
|-- unitPrice: double (nullable = true)
|-- version: string (nullable = true)
|-- requirements: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- product: string (nullable = true)
| | |-- quantity: double (nullable = true)
| | |-- from: timestamp (nullable = true)
| | |-- to: timestamp (nullable = true)
I'm facing a problem because I have to add to my final dataframe some data with product, quantity, from, to = Null: using the code above I get "requirements":[{}] , but the DB where I write the file (MongoDB) get an error with the empty JSON object because it expects "requirements":[] for null values.
I've tried with
import pyspark.sql.functions as F
df = final_prova2.withColumn("requirements",
F.when(final_prova2.requirements.isNull(),
F.array()).otherwise(final_prova2.requirements))
but it doesn't work. Any suggestion on how modify the code? I'm struggling to find a solution (I don't even know if a solution is possible considering the structure required).
Thanks
Upvotes: 0
Views: 427
Reputation: 13998
You need to check if all 4 fields of requirements
are NULL, not the column itself. One way you can fix this is to adjust the collect_list aggregate function when creating final2:
import pyspark.sql.functions as F
final2 = final.groupby('identifier', 'plant', 'family', 'familyDescription', 'type', 'name', 'description', 'batchSize', 'phantom', 'makeOrBuy', 'safetyStock', 'unit', 'unitPrice', 'version') \
.agg(F.expr("""
collect_list(
IF(coalesce(quantity, product, from, to) is NULL
, NULL
, struct(product, quantity, from, to)
)
)
""").alias('requirements'))
Where:
we use an SQL expression IF(condition, true_value, false_value)
to set up the argument for collect_list
the condition: coalesce(quantity, product, from, to) is NULL
is to test if all listed 4 columns are NULL, if it's true, return NULL, otherwise return struct(product, quantity, from, to)
Upvotes: 1