Reputation: 821
I have a large (more than 8.5GB) CSV file that is updated on the first day of each month. But from the 2nd to the last day of each month, it can have new updated data in the JSON format.
I convert the CSV to panquet and do the query in Apache Drill, it works fine. But how can I query the big file with the updated file?
e.g. In the Apr 1st CSV file, it has
ID Name Value LastUpdatedTime
100 John 98 2024-01-05
In the Apr 15 JSON file, it has
ID Name Value LastUpdatedTime
100 John 100 2024-04-15
When it query all these files for ID = 100, it should give Value=100 as it has newer LastUpdatedTime.
I find this post saying people use Drill on data that is no longer changing.
Is that true?
Upvotes: 1
Views: 32
Reputation: 1389
It is true that Drill does not support modifying existing data but I don't think that you need that here. Have you tried something like
with combined as (
select ID, Name, Value, LastUpdatedTime from dfs.csv_data
union all
select ID, Name, Value, LastUpdatedTime from dfs.json_data
), ranked as (
select *, row_number() over (partition by ID, order by LastUpdatedTime desc) rank
)
select * from ranked where rank = 1;
?
Upvotes: 0