Reputation: 101
I am using Databricks SQL to query a dataset that has a column formatted as an array, and each item in the array is a struct with 3 named fields.
I have the following table:
id | array |
---|---|
1 | [{"firstName":"John","lastName":"Smith","age":"10"},{"firstName":"Jane","lastName":"Smith","age":"12"}] |
2 | [{"firstName":"Bob","lastName":"Miller","age":"13"},{"firstName":"Betty","lastName":"Miller","age":"11"}] |
In a different SQL editor, I was able to achieve this by doing the following:
SELECT
id,
struct.firstName
FROM
table
CROSS JOIN UNNEST(array) as t(struct)
With a resulting table of:
id | firstName |
---|---|
1 | John |
1 | Jane |
2 | Bob |
2 | Betty |
Unfortunately, this syntax does not work in the Databricks SQL editor, and I get the following error.
[UNRESOLVED_COLUMN] A column or function parameter with name `array` cannot be resolved.
I feel like there is an easy way to query this, but my search on Stack Overflow and Google has come up empty so far.
Upvotes: 6
Views: 12255
Reputation: 1167
The first solution uses the SQL API. The first code snippet prepares the test case, so you can ignore it if you already have it in place.
import pyspark.sql.types
schema = StructType([
StructField('id', IntegerType(), True),
StructField("people", ArrayType(StructType([
StructField('firstName', StringType(), True),
StructField('lastName', StringType(), True),
StructField('age', StringType(), True)
])), True)
])
sql_df = spark.createDataFrame([
(1, [{"firstName":"John","lastName":"Smith","age":"10"},{"firstName":"Jane","lastName":"Smith","age":"12"}]),
(2, [{"firstName":"Bob","lastName":"Miller","age":"13"},{"firstName":"Betty","lastName":"Miller","age":"11"}])
], schema)
sql_df.createOrReplaceTempView("sql_df")
What you need to use is the LATERAL VIEW
clause (docs) which allows to explode the nested structures, like this:
SELECT id, exploded.firstName
FROM sql_df
LATERAL VIEW EXPLODE(sql_df.people) sql_df AS exploded;
+---+---------+
| id|firstName|
+---+---------+
| 1| John|
| 1| Jane|
| 2| Bob|
| 2| Betty|
+---+---------+
The alternative approach is to use explode
method (docs), which gives you the same results, like this:
from pyspark.sql.functions import explode, col
sql_df.select("id", explode(col("people.firstName"))).show()
+---+-----+
| id| col|
+---+-----+
| 1| John|
| 1| Jane|
| 2| Bob|
| 2|Betty|
+---+-----+
Upvotes: 6