Reputation: 443
I have a nested json structure loaded into a dataframe in spark. It contains several layers of arrays and I'm trying to figure out how to query this structure by values in the internal arrays.
Example: consider the following structure (directors.json file)
[
{
"director": "Steven Spielberg",
"films": [
{
"name": "E.T",
"actors": ["Henry Thomas", "Drew Barrymore"]
},
{
"name": "The Goonies",
"actors": ["Sean Astin", "Josh Brolin"]
}
]
},
{
"director": "Quentin Tarantino",
"films": [
{
"name": "Pulp Fiction",
"actors": ["John Travolta", "Samuel L. Jackson"]
},
{
"name": "Kill Bill: Vol. 1",
"actors": ["Uma Thurman", "Daryl Hannah"]
}
]
}
]
Lets say I want to run a query that will return all the films that a specific actor has participated in. something like this:
val directors = spark.read.json("directors.json")
directors.select($"films.name").where($"films.actors" === "Henry Thomas")
When I run this in the spark shell I get an exception:
org.apache.spark.sql.AnalysisException: cannot resolve '(`films`.`actors` = 'Henry Thomas')' due to data type mismatch: differing types in '(`films`.`actors` = 'Henry Thomas')' (array<array<string>> and string).;;
'Project [name#128]
+- 'Filter (films#92.actors = Henry Thomas)
+- AnalysisBarrier
+- Project [films#92.name AS name#128, films#92]
+- Relation[director#91,films#92] json
How do I properly make such a query?
Are there different alternatives? If So, what are the pros and cons?
Thanks
Edit
@thebluephantom this still doesn't work. getting similar exception. I think it's because I have an array within another array. This is the exception:
org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(`films`.`actors`, 'Henry Thomas')' due to data type mismatch: Arguments must be an array followed by a value of same type as the array members;;
'Filter array_contains(films#7.actors, Henry Thomas)
+- AnalysisBarrier
+- Project [director#6, films#7]
+- Relation[director#6,films#7] json
Upvotes: 0
Views: 261
Reputation: 21
problem statement : Lets say I want to run a query that will return all the films that a specific actor has participated in.
solution :
val nested_json = spark.read.option("multiline","true").json("\\directors.json")//path to json file
import spark.implicits._
val films_directors_df = nested_json.select("director","films.name")
val films_directors_flatten = films_directors_df.withColumn("name",explode($"name")).as("films")
films_directors_flatten.show(false)
val mys = nested_json.select("films")
val films__flatten = mys.withColumn("films",explode($"films"))
val actors_name_df = films__flatten.select("films.actors","films.name")
//actors_name_df.show(false) //contains both actors across film
val actors_name_flatten = actors_name_df.withColumn("actors",explode($"actors")).as("actors")
actors_name_flatten.show(false)
val final_df = films_directors_flatten
.join(actors_name_flatten,$"films.name"===$"actors.name")
.select($"films.director" as "directors",$"films.name" as "films",$"actors.actors" as "actors")
final_df.show(false)
final_df is the dataframe where you can query for all films that a specific actor has participated in. This solves your above question.
sample query : final_df.filter($"actors"==="Henry Thomas").show(false)
Cheers!!
Upvotes: 0
Reputation: 18053
Try something similar to this whereby the film data must be exploded which means the repeating group of actors is simple normalized - otherwise I cannot get it to work either - maybe someone else can:
More complete using SPARK 2.3.1 as follows with your data:
val df = spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/FileStore/tables/films.txt")
val flattened = df.select($"director", explode($"films").as("films_flat"))
flattened.select ("*").where (array_contains (flattened("films_flat.actors"), "Henry Thomas")).show(false)
returns:
+----------------+-------------------------------------+
|director |films_flat |
+----------------+-------------------------------------+
|Steven Spielberg|[[Henry Thomas, Drew Barrymore], E.T]|
+----------------+-------------------------------------+
Upvotes: 1