Reputation: 11
I have JSON files falling in our blob with two fields:
offset
(integer)value
(base64)This value
column is JSON with unicode (and that's why it's base64-encoded).
{
"offset": 1,
"value": "eyJfaWQiOiAiNjQxY2I3MWQyY...a very long base64-encoded text"
}
Challenge is that this base64-encoded value
JSON is very large with 100+ fields so we cannot define the schema. We can only have some schema hints. And Auto Loader seems the best fit.
I tried to use Autoloader with schema hints and other options. It always picks value
as a string and unable to parse it back to JSON without providing the schema.
I want databricks to infer the schema.
Infer the schema automatically. the schema is more dynamic. so they can add any field and remove any field at any time. so having fixed schema wont work at all. so we need infer the schema automatically.
But I can see it always infer it as string or the json is escaped as string in the value column makes it impoossible to infer with Autoloader.
Upvotes: 0
Views: 427
Reputation: 74749
Have you tried to base64-decode value
column and use from_json
standard function? That'd be my first approach to handle the case.
If that does not work, I'd do the data processing in two steps (you could use Databricks Jobs or even Delta Live Tables pipelines).
The first step would be load the incoming files using Auto Loader and decode the value
column and write it out to a blob storage for the next step to read it back (again with Auto Loader).
I really think there's no way Auto Loader can handle base64-encoded data as JSON so it needs some help (and hence this Hadoop MapReduce-like data processing pipeline).
I think you could even "convert" the incoming JSON files into parquet for faster loading and perhaps even delta tables could work fine. There are two JSON layers of the data (files) and I'm referring to the JSON format of the entire data file while thinking of parquet conversion.
Upvotes: 0