Reputation: 650
I wish to access different fields / subfields from a fairly deeply nested structure with arrays in order to do arithmetic operations on them. Some of the data is actually in the field names themselves (the structure that I have to access is created that way and there is nothing I can do about that). In particular, I have a list of numbers as the field names which I must use, and these will change from one json file to the next, so I must dynamically infer what those field names are and then use them with subfield values.
I've looked at this: Access names of fields in struct Spark SQL Unfortunately, I do not know what will be the field names for my structure so I cannot use this.
I've also tried this, which looked promising: how to extract the column name and data type from nested struct type in spark Unfortunately, whatever the magic in the "flatten" function does, I have not been able to adapt it to fieldnames rather than fields themselves.
Here is an example json dataset. It represents consumption baskets:
{"type":"test","name":"john doe","products":{
"baskets":{
"comp A":{
"55.80":[{"type":"fun","comment":{"qty":500,"text":"hello"}},{"type":"work","comment":{"qty":600,"text":"hello"}}]
,"132.88":[{"type":"fun","comment":{"qty":700,"text":"hello"}}]
,"0.03":[{"type":"fun","comment":{"qty":500,"text":"hello"}},{"type":"work","comment":{"qty":600,"text":"hello"}}]
}
,"comp B":{
"55.70":[{"type":"fun","comment":{"qty":500,"text":"hello"}},{"type":"work","comment":{"qty":600,"text":"hello"}}]
,"132.98":[{"type":"fun","comment":{"qty":300,"text":"hello"}},{"type":"work","comment":{"qty":900,"text":"hello"}}]
,"0.01":[{"type":"fun","comment":{"qty":400,"text":"hello"}}]
}
}
}}
I would like to obtain all these numbers in a dataframe in order to do some operations on them:
+ -------+---------+----------+
+ basket | price | quantity +
+ -------+---------+----------+
+ comp A | 55.80 | 500 +
+ comp A | 55.80 | 600 +
+ comp A | 132.88 | 700 +
+ comp A | 0.03 | 500 +
+ comp A | 0.03 | 600 +
+ comp B | 55.70 | 500 +
+ comp B | 55.70 | 600 +
+ comp B | 132.98 | 300 +
+ comp B | 132.98 | 900 +
+ comp B | 0.01 | 400 +
+ -------+---------+----------+
The original dataset is accessed as such:
scala> myDs
res135: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [products: struct<baskets: struct<compA: struct<55.80: array<struct .....
Upvotes: 3
Views: 712
Reputation: 18013
This approach of processing data that comes in as a column name is not an approach to follow. It will simply not work.
Upvotes: 1