Philip
Philip

Reputation: 101

Querying struct within array - Databricks SQL

I am using Databricks SQL to query a dataset that has a column formatted as an array, and each item in the array is a struct with 3 named fields.

I have the following table:

id array
1 [{"firstName":"John","lastName":"Smith","age":"10"},{"firstName":"Jane","lastName":"Smith","age":"12"}]
2 [{"firstName":"Bob","lastName":"Miller","age":"13"},{"firstName":"Betty","lastName":"Miller","age":"11"}]

In a different SQL editor, I was able to achieve this by doing the following:

SELECT
id,
struct.firstName
FROM
table
CROSS JOIN UNNEST(array) as t(struct)

With a resulting table of:

id firstName
1 John
1 Jane
2 Bob
2 Betty

Unfortunately, this syntax does not work in the Databricks SQL editor, and I get the following error.

[UNRESOLVED_COLUMN] A column or function parameter with name `array` cannot be resolved.

I feel like there is an easy way to query this, but my search on Stack Overflow and Google has come up empty so far.

Upvotes: 6

Views: 12255

Answers (1)

Bartosz Gajda
Bartosz Gajda

Reputation: 1167

1. SQL API

The first solution uses the SQL API. The first code snippet prepares the test case, so you can ignore it if you already have it in place.

import pyspark.sql.types

schema = StructType([
    StructField('id', IntegerType(), True),
    StructField("people", ArrayType(StructType([
        StructField('firstName', StringType(), True),
        StructField('lastName', StringType(), True),
        StructField('age', StringType(), True)
    ])), True)
])

sql_df = spark.createDataFrame([
    (1, [{"firstName":"John","lastName":"Smith","age":"10"},{"firstName":"Jane","lastName":"Smith","age":"12"}]),
    (2, [{"firstName":"Bob","lastName":"Miller","age":"13"},{"firstName":"Betty","lastName":"Miller","age":"11"}])
], schema)
sql_df.createOrReplaceTempView("sql_df")

What you need to use is the LATERAL VIEW clause (docs) which allows to explode the nested structures, like this:

SELECT id, exploded.firstName
FROM sql_df
LATERAL VIEW EXPLODE(sql_df.people) sql_df AS exploded;

+---+---------+
| id|firstName|
+---+---------+
|  1|     John|
|  1|     Jane|
|  2|      Bob|
|  2|    Betty|
+---+---------+

2. DataFrame API

The alternative approach is to use explode method (docs), which gives you the same results, like this:

from pyspark.sql.functions import explode, col

sql_df.select("id", explode(col("people.firstName"))).show()

+---+-----+
| id|  col|
+---+-----+
|  1| John|
|  1| Jane|
|  2|  Bob|
|  2|Betty|
+---+-----+

Upvotes: 6

Related Questions