Reputation: 391
I have a large dataset with ~40.000.000.000 records, which is partitioned by date.
To ensure that a single user does not query the entire dataset at once, I would like to provide a SQL function instead of direct access to the dataset. The SQL function has an @from and an @to data as input parameters and checks that the function call will not query more than the data of 14 days at once.
So far so good, but what doesn't work is to use a date range (list of dates) as filter criteria in my OPENROWSET query.
I have already tried to provide the date list as subquery, #tempTable and @tableVar, but nothing seems to work.
Below you can see my last attempt with a subquery.
DECLARE @From DATE = '2020-10-17', @To DATE = '2020-10-25';
SELECT *
FROM OPENROWSET(BULK '/date=*/*', DATA_SOURCE = 'data', FORMAT='PARQUET') as rows
WHERE rows.filepath(1) In (
(
SELECT TOP (DATEDIFF(DAY, @From, @To) + 1) Date = DATEADD(DAY, ROW_NUMBER() OVER(ORDER BY a.object_id) - 1, @From)
FROM sys.all_objects a
CROSS JOIN sys.all_objects b
) -- Date list based on the @From and @To
When I run this code I get the following error message, just like when I use a #tempTable
The query references an object that is not supported in distributed processing mode.
I am grateful for every tip.
Upvotes: 0
Views: 1498
Reputation: 391
Oh man, thanks for the tip kashyap.
Somehow I did not try the most obvious solution before:
DECLARE @From DATE = '2020-10-17', @To DATE = '2020-10-25';
SELECT *
FROM OPENROWSET(BULK '/date=*/*', DATA_SOURCE = 'data', FORMAT='PARQUET') as rows
WHERE rows.filepath(1) >= @From and rows.filepath(1) <= @To
This solution provides the correct result and also utilizes the partitioning to ensure that as few data as possible has to be retrieved from the data lake.
Upvotes: 2