Lakkhichhara
Lakkhichhara

Reputation: 75

json file nested column value is coming as null in pyspark

json file: "products":[{"productId":"8d809e13-fdc5-4d15-9271- 953750f6d592","quantity":500,"soldPrice":104.852}, {"productId":"ec15ba1d-53b6-44b0-8a22- 1e498485f1b8","quantity":300,"soldPrice":94.3668}, {"productId":"e672483e-57a8-434a-bc42- ecf827c8a8d4","quantity":1000,"soldPrice":109.57034}], "shippingAddress":{"attention":"Khaleesi Frost","address":"493 Augustine Drive N ","city":"Miramar","state":"FL","zip":"33785"}

I can access the 'products' using explode method as it's ArrayType. But the 'shippingAddress' values are coming as null.

`select(F.col("shippingAddress.attention").alias("shipping_address_attention"),
   F.col("shippingAddress.address").alias("shipping_address_address"))`

json Schema: StructField('shippingAddress' , StructType([ StructField('attention' , StringType(), True), StructField('address' , StringType(), True), StructField('city' , StringType(), True), StructField('state' , StringType(), True), StructField('zip' , IntegerType(), True) ]))

Upvotes: 0

Views: 1334

Answers (1)

Mohana B C
Mohana B C

Reputation: 5487

Check if you have a valid JSON here.

If you are using same JSON as you posted in your question then you should enclose whole JSON within {}. While reading JSON use option multiline as True.

>>> from pyspark.sql.functions import *
>>> df = spark.read.option('multiline', True).json('file.json')
>>> df.printSchema()
root
 |-- products: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- productId: string (nullable = true)
 |    |    |-- quantity: long (nullable = true)
 |    |    |-- soldPrice: double (nullable = true)
 |-- shippingAddress: struct (nullable = true)
 |    |-- address: string (nullable = true)
 |    |-- attention: string (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- zip: string (nullable = true)

>>> df.select(col("shippingAddress.attention").alias("shipping_address_attention"),col("shippingAddress.address").alias("shipping_address_address")).show(truncate=False)
+--------------------------+------------------------+
|shipping_address_attention|shipping_address_address|
+--------------------------+------------------------+
|Khaleesi Frost            |493 Augustine Drive N   |
+--------------------------+------------------------+

UPDATE

You have mentioned zip column type as IntegerType() but in JSON it's string, due to this type mismatch you are getting Null in the column. so change type to StringType() in your custom schema.

Your schema should be,

StructType([StructField('shippingAddress' , StructType([ StructField('attention' , StringType(), True), StructField('address' , StringType(), True),                                 
StructField('city' , StringType(), True), StructField('state' , StringType(), True), StructField('zip' , StringType(), True) ]))])

Upvotes: 2

Related Questions