kklo
kklo

Reputation: 821

Can Apache Drill query a list of files with updated data?

I have a large (more than 8.5GB) CSV file that is updated on the first day of each month. But from the 2nd to the last day of each month, it can have new updated data in the JSON format.

I convert the CSV to panquet and do the query in Apache Drill, it works fine. But how can I query the big file with the updated file?

e.g. In the Apr 1st CSV file, it has

ID          Name           Value    LastUpdatedTime
100         John           98       2024-01-05

In the Apr 15 JSON file, it has

ID          Name           Value    LastUpdatedTime
100         John           100      2024-04-15

When it query all these files for ID = 100, it should give Value=100 as it has newer LastUpdatedTime.

I find this post saying people use Drill on data that is no longer changing.

Is that true?

Upvotes: 1

Views: 32

Answers (1)

Dzamo Norton
Dzamo Norton

Reputation: 1389

It is true that Drill does not support modifying existing data but I don't think that you need that here. Have you tried something like

with combined as (
  select ID, Name, Value, LastUpdatedTime from dfs.csv_data
  union all
  select ID, Name, Value, LastUpdatedTime from dfs.json_data
), ranked as (
  select *, row_number() over (partition by ID, order by LastUpdatedTime desc) rank
)
select * from ranked where rank = 1;

?

Upvotes: 0

Related Questions