Afrobeta
Afrobeta

Reputation: 37

Extract Schema from nested Json-String column in Pyspark

Assuming I have the following table:

body
{"Day":1,"vals":[{"id":"1", "val":"3"}, {"id":"2", "val":"4"}]}

My goal is to write down the schema in Pyspark for this nested json column. I've tried the following two things:

schema = StructType([
  StructField("Day", StringType()),
  StructField(
  "vals",
  StructType([
    StructType([
      StructField("id", StringType(), True),
      StructField("val", DoubleType(), True)
    ])
    StructType([
      StructField("id", StringType(), True),
      StructField("val", DoubleType(), True)
    ])
  ])
  )
])

Here I get the error that of

'StructType' object has no attribute 'name'

Another approach was to declare the nested Arrays as ArrayType:

schema = StructType([
  StructField("Day", StringType()),
  StructField(
  "vals",
  ArrayType(
    ArrayType(
        StructField("id", StringType(), True),
        StructField("val", DoubleType(), True)
      , True)
    ArrayType(
        StructField("id", StringType(), True),
        StructField("val", DoubleType(), True)
      , True)
    , True)
  )
])

Here I get the following error:

takes from 2 to 3 positional arguments but 5 were given

Which propably comes from the array only taking the Sql type as an argument.

Can anybody tell me what their approach would be to create the schema, since I'm a super newbie to the whole topic..

Upvotes: 1

Views: 1051

Answers (1)

Guillermo Garcia
Guillermo Garcia

Reputation: 573

This is the structure you are looking for:

Data = [
    (1, [("1","3"), ("2","4")])
  ]

schema = StructType([
        StructField('Day', IntegerType(), True),
        StructField('vals', ArrayType(StructType([
            StructField('id', StringType(), True),
            StructField('val', StringType(), True)
            ]),True))
         ])
df = spark.createDataFrame(data=Data,schema=schema)
df.printSchema()
df.show(truncate=False)

This will get you the next output:

root
 |-- Day: integer (nullable = true)
 |-- vals: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- val: string (nullable = true)

+---+----------------+
|Day|vals            |
+---+----------------+
|1  |[{1, 3}, {1, 3}]|
+---+----------------+

Upvotes: 1

Related Questions