botchniaque
botchniaque

Reputation: 5134

Convert struct to map in Spark SQL

I am trying to convert one dataset which declares a column to have a certain struct type (eg. struct<x: string, y: string>) to a map<string, string> type. And I would like to do it in SQL, possibly without using UDFs.

UPDATE:

My requirement is also that the transformation is done generically without any prior knowledge of the struct keys (in my problem I am getting data in a complex JSON, and I don't want to keep that complexity in the schema).

Example input data:

WITH input (struct_col) as (
    select named_struct('x', 'valX', 'y', 'valY') union all
    select named_struct('x', 'valX1', 'y', 'valY2')
)
select *
from input

expected output is a column of type map<string, string>

struct_col:map<string, string>
{"x":"valX","y":"valY"}
{"x":"valX1","y":"valY2"}

UPDATE:

So far I managed to find this very convoluted solution which works only with Spark >= 3.1.0 (because of json_object_keys function). Would be so nice to just be able to cast a struct to a map

WITH input (struct_col) as (
    select named_struct('x', 'valX', 'y', 'valY') union all
    select named_struct('x', 'valX1', 'y', 'valY2')
)
select transform_values(
       map_from_arrays(
               json_object_keys(to_json(struct_col)),
               json_object_keys(to_json(struct_col))
           ),
        (k, v) -> get_json_object(to_json(struct_col), '$.' || k))
from input

Upvotes: 6

Views: 10233

Answers (2)

Topde
Topde

Reputation: 581

What about

create_map('struct_col.x', 'struct_col.valX', 'struct_col.y', 'struct_col.valY')

Upvotes: 0

botchniaque
botchniaque

Reputation: 5134

I have found a way to do it which requires one roundtrip of serializing and parsing a json using the to_json and from_json functions. The trick is that the from_json also takes a schema argument where I use the map<string, string> type.

Also, this solution should work with spark < 3.x

WITH input (struct_col) as (
    select named_struct('x', 'valX', 'y', 'valY')
    union all
    select named_struct('x', 'valX1', 'y', 'valY2')
) 
select from_json(to_json(struct_col), 'map<string, string>') as map_col
from input;

Upvotes: 10

Related Questions