Reputation: 406
I have multiple JSON files in the file structure:
And so on.
All the JSON files have the same schema.
However, I'm having issues ingesting this data into Databricks. I've set the recursiveFileLookup
option to true
and I'm able to ingest it but I end up with a document per line (each file on one row).
Is there any way to merge the data into one row in Databricks?
Also, if anyone has a solution to ingest this data into a data flow in Azure Data Factory, please share!
Thanks!
Upvotes: 0
Views: 76
Reputation: 3250
I have tried the below approach in the Azure databricks:
Read the filepath using recursiveFileLookup
and merge the data into a single row.
df_files = spark.read.option("recursiveFileLookup", "true").json("/FileStore/tables/Commodity1/Interval/")
df_files.show(truncate=False)
+----------+----------+-----+
|commodity |interval |value|
+----------+----------+-----+
|Commodity1|2024-01-01|100 |
|Commodity1|2024-01-01|200 |
|Commodity1|2024-01-01|300 |
|Commodity1|2024-01-02|150 |
|Commodity1|2024-01-02|250 |
+----------+----------+-----+
merged_df = df_files.agg(collect_list("value").alias("merged_values"))
merged_df.show(truncate=False)
+-------------------------+
|merged_values |
+-------------------------+
|[100, 200, 300, 150, 250]|
+-------------------------+
grouped_df = df_files.groupBy("interval").agg(collect_list("value").alias("merged_values"))
grouped_df.show(truncate=False)
Results:
+----------+---------------+
|interval |merged_values |
+----------+---------------+
|2024-01-01|[100, 200, 300]|
|2024-01-02|[150, 250] |
+----------+---------------+
In the above code reading all JSON files in the nested directory structure Aggregating all values into a single list within one row & groupping by interval and collect values for each date
Upvotes: 0