Saugat Mukherjee
Saugat Mukherjee

Reputation: 1000

Databricks SQL Query Nested Json column which is stored as string

So, I have data in a column which looks like this:

select additional_data,typeof(additional_data) as type 
from table

sample data.

I am interested in querying the field appName. However, I can't seem to find a way to do this.

I have tried:

select additional_data.content.appName
from table

but that gives me an error saying

can't extract value from additional_data#10527.content: need struct type but got string

since content is string, I then tried casting it to struct , using something like

select additional_data, 
       cast(additional_data.content as struct)
from table

but that gave me

DataType struct is not supported

I then tried to check if the content could be changed to struct, but I couldn't with

select additional_data,
named_struct('content', additional_data.content),
typeof(named_struct('content', additional_data.content))
from table

The type remained the same as in the original case with the content still being a string

stillstring

I also looked at this link, but even that I could not get to work

If I did this

select additional_data,
additional_data:content
from table

it errored with

Cannot resolve 'semi_structured_extract_json_multi(table.additional_data, '$.content')' due to data type mismatch: argument 1 requires string type, however, 'table.additional_data' is of struct<content:string> type.

Can someone help with how I can do this correctly and of course efficiently?

Let's assume that the original table can't be altered and the end users can only use pure sql (no spark dataframes)

I can of course answer questions.

Upvotes: 1

Views: 5609

Answers (2)

Saugat Mukherjee
Saugat Mukherjee

Reputation: 1000

So, two options:

  1. More prone to breaking changes , if new properties are added, as we have to define an exact schema. Something like
select from_json(additional_data.content,'searchPhrase string, isResultFound string, appName string').appName
from table
  1. Use an approach which is immune to changes
select get_json_object(additional_data.content,  '$.appName') 
from table

Upvotes: 1

vilalabinot
vilalabinot

Reputation: 1611

The idea is to convert your first line to a structured value, extract the content from content, then again parse your string to another structured value (through from_json), then extract the values from the key-value pair.

This should do the trick:

val df = spark.sql(
  """select content,
    |from_json(content, 'struct<searchPhrase:string>').searchPhrase as searchPhrase,
    |from_json(content, 'struct<isResultFound:string>').isResultFound as isResultFound,
    |from_json(content, 'struct<appName:string>').appName as appName
    |from (select from_json(additional_data, 'struct<content:string>').content as content from table)
    |"""
    .stripMargin)

or in Dataset API:

ds = ds
  .withColumn("content", expr("from_json(additional_data, 'STRUCT<content:STRING>').content"))
  .withColumn("other_json", expr("from_json(content, 'struct<searchPhrase:string,isResultFound:string,appName:string>')"))
  .withColumn("searchPhrase", expr("other_json.searchPhrase"))
  .withColumn("isResultFound", expr("other_json.isResultFound"))
  .withColumn("appName", expr("other_json.appName"))
  .drop("other_json", "additional_data")

Final result:

+---------------------------------------------------------------------+------------+-------------+-------+
|content                                                              |searchPhrase|isResultFound|appName|
+---------------------------------------------------------------------+------------+-------------+-------+
|{"searchPhrase": "shor", "isResultFound": "true", "appName": "phone"}|shor        |true         |phone  |
+---------------------------------------------------------------------+------------+-------------+-------+

Good luck!

Upvotes: 1

Related Questions