Code Junkie
Code Junkie

Reputation: 1459

Why do several datasets have an Array of Structs in Apache Spark

I see that several datasets have an array of Structs inside of an element instead of an Array of String or Integer.

 |-- name: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- value: string (nullable = true)

I was wondering why because ultimately what I want is to be able to represent an Array of Strings then why have a struct in between.

Upvotes: 0

Views: 68

Answers (1)

Salim
Salim

Reputation: 2178

You can hold array of Strings using ArrayType and StructField. You don't need to use StructType inside StructField. In the example, column2 can hold array of String. Please see schema for "column2". Nevertheless the schema for the whole row will be a StructType.

StructType( 
Array(
StructField("column1", LongType, nullable = true),
StructField("column2", ArrayType(StringType, true), nullable = true)
)
)

You need a StructType to hold a complex type which consists of many data types. It is like holding a table within a column. Please see schema for "column2".

StructType( 
Array(
StructField("column1", LongType, nullable = true),
StructField("column2", ArrayType(StructType(Array(
      StructField("column3", StringType, nullable = true),
      StructField("column4", StringType, nullable = true))),
 true)
)
)

Upvotes: 1

Related Questions