tyogi
tyogi

Reputation: 650

Accessing a json array nested in a structure using Spark

I wish to access different fields / subfields from a fairly deeply nested structure with arrays in order to do arithmetic operations on them. Some of the data is actually in the field names themselves (the structure that I have to access is created that way and there is nothing I can do about that). In particular, I have a list of numbers as the field names which I must use, and these will change from one json file to the next, so I must dynamically infer what those field names are and then use them with subfield values.

I've looked at this: Access names of fields in struct Spark SQL Unfortunately, I do not know what will be the field names for my structure so I cannot use this.

I've also tried this, which looked promising: how to extract the column name and data type from nested struct type in spark Unfortunately, whatever the magic in the "flatten" function does, I have not been able to adapt it to fieldnames rather than fields themselves.

Here is an example json dataset. It represents consumption baskets:

{"type":"test","name":"john doe","products":{
    "baskets":{
        "comp A":{
            "55.80":[{"type":"fun","comment":{"qty":500,"text":"hello"}},{"type":"work","comment":{"qty":600,"text":"hello"}}]
            ,"132.88":[{"type":"fun","comment":{"qty":700,"text":"hello"}}]
            ,"0.03":[{"type":"fun","comment":{"qty":500,"text":"hello"}},{"type":"work","comment":{"qty":600,"text":"hello"}}]
        }
        ,"comp B":{
            "55.70":[{"type":"fun","comment":{"qty":500,"text":"hello"}},{"type":"work","comment":{"qty":600,"text":"hello"}}]
            ,"132.98":[{"type":"fun","comment":{"qty":300,"text":"hello"}},{"type":"work","comment":{"qty":900,"text":"hello"}}]
            ,"0.01":[{"type":"fun","comment":{"qty":400,"text":"hello"}}]
        }
    }
}}

I would like to obtain all these numbers in a dataframe in order to do some operations on them:

+ -------+---------+----------+
+ basket | price   | quantity +
+ -------+---------+----------+
+ comp A | 55.80   | 500      +
+ comp A | 55.80   | 600      +
+ comp A | 132.88  | 700      +
+ comp A | 0.03    | 500      +
+ comp A | 0.03    | 600      +
+ comp B | 55.70   | 500      +
+ comp B | 55.70   | 600      +
+ comp B | 132.98  | 300      +
+ comp B | 132.98  | 900      +
+ comp B | 0.01    | 400      +
+ -------+---------+----------+

The original dataset is accessed as such:

scala> myDs
res135: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [products: struct<baskets: struct<compA: struct<55.80: array<struct .....

Upvotes: 3

Views: 712

Answers (1)

Ged
Ged

Reputation: 18013

This approach of processing data that comes in as a column name is not an approach to follow. It will simply not work.

Upvotes: 1

Related Questions