Reputation: 1882
I'm trying to read json into dataset (spark 2.3.2). Unfortunately it doesn't works well .
Here is the data ,it's a json file with a inner array
{ "Name": "helloworld",
"info": { "privateInfo": [ {"salary":1200}, {"sex":"M"}],
"house": "sky road"
},
"otherinfo":2
}
{ "Name": "helloworld2",
"info": { "privateInfo": [ {"sex":"M"}],
"house": "sky road"
},
"otherinfo":3
}
I use sparksession to select the columns, but it has some problems: the result is not the data itself,but in an array.
val sqlDF = spark.sql("SELECT name , info.privateInfo.salary ,info.privateInfo.sex FROM people1 ")
sqlDF.show()
But columns salary
& sex
are in an array:
+-----------+-------+-----+
| name| salary| sex|
+-----------+-------+-----+
| helloworld|[1200,]|[, M]|
|helloworld2| []| [M]|
+-----------+-------+-----+
How could I get the data with the datatype itself?
Such as
+-----------+-------+-----+
| name| salary| sex|
+-----------+-------+-----+
| helloworld| 1200 | M |
|helloworld2|none/null| M |
+-----------+-------+-----+
Upvotes: 1
Views: 176
Reputation: 1704
Short answer
spark.sql("SELECT name , " +
"element_at(filter(info.privateInfo.salary, salary -> salary is not null), 1) AS salary ," +
"element_at(filter(info.privateInfo.sex, sex -> sex is not null), 1) AS sex" +
" FROM people1 ")
+-----------+------+---+
| name|salary|sex|
+-----------+------+---+
| helloworld| 1200| M|
|helloworld2| null| M|
+-----------+------+---+
Long answer
The main concern is the array nullability
root
|-- Name: string (nullable = true)
|-- info: struct (nullable = true)
| |-- house: string (nullable = true)
| |-- privateInfo: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- salary: long (nullable = true)
| | | |-- sex: string (nullable = true)
|-- otherinfo: long (nullable = true)
So we need a way to filter out null values, luckily spark 2.4
has built-in Higher-Order Functions
The first attempt was to use array_remove
, but unfortunately null
can never be equal to null
.
It is still possible with more verbose syntax
df.selectExpr("filter(info.privateInfo.salary, salary -> salary is not null)")
+------+
|salary|
+------+
|[1200]|
| []|
+------+
Now we need some way to explode the array, luckily us spark has explode
function!
df.selectExpr(
"explode(filter(info.privateInfo.salary, salary -> salary is not null)) AS salary",
"explode(filter(info.privateInfo.sex, sex -> sex is not null)) AS sex")
Boom
Exception in thread "main" org.apache.spark.sql.AnalysisException: Only one generator allowed per select clause but found 2
As we know there should be exactly one value in the array, we can use element_at
df.selectExpr(
"element_at(filter(info.privateInfo.salary, salary -> salary is not null), 1) AS salary",
"element_at(filter(info.privateInfo.sex, sex -> sex is not null), 1) AS sex")
p.s. haven't noticed it was asked 10 months ago
Upvotes: 1