Clayton A. Santos
Clayton A. Santos

Reputation: 397

How to list all parquet files in s3 bucket folder via OPENROWSET (SQL Server)?

I have a bucket (AWS) in a folder with 3 PARQUET files that are the same and have different names:

enter image description here

I'm trying to create an EXTERNAL TABLE with the code below:

CREATE EXTERNAL TABLE tb_Test
(
coluna_1 INT,
coluna_2 VARCHAR(100)
)
WITH 
    (
    LOCATION = '/testeParquet/**',
    DATA_SOURCE = s3_ds, 
    FILE_FORMAT = ParquetFileFormat
    );
GO

when I try to read the External table I get this error message:

Msg 16561, Level 16, State 1, Line 109
External table 'db_S3.dbo.tb_Test' is not accessible because content of directory cannot be listed.

If I inform the name of the file it creates and reads correctly.

But I would like to create with all the files in the folder without having to inform file by file.

Upvotes: 1

Views: 784

Answers (2)

Tomasz Kapcia
Tomasz Kapcia

Reputation: 21

I had similar problem and it seems that in SQL Server 2022 listing of S3 folder is not supported yet, but listing S3 bucket is working. See this example, in which i selected data from all files in XXX bucket:

This SQL is working (after creating instance CREDENTIAL to S3 bucket):

SELECT 
     r.filepath()
    ,r.filename()
    ,count(*) cnt
FROM OPENROWSET(
    BULK 's3://XXX.s3.eu-central-1.amazonaws.com/*.csv'
    ,FORMAT = 'CSV'
)
WITH ( 
     [col1] VARCHAR(255)
    ,[col2] VARCHAR(255)
    ,[col3] VARCHAR(255)
    ,[col4] VARCHAR(255)
    ,[col5] VARCHAR(255)
) AS [r]
GROUP BY 
    r.filepath()
    ,r.filename()

but after changing bucket to folder:

BULK 's3://XXX.s3.eu-central-1.amazonaws.com/somefolder/*.csv'

SQL Server is returning error:

Msg 13807, Level 16, State 1, Line 75
Content of directory on path 's3://XXX.s3.eu-central-1.amazonaws.com/somefolder/*.csv' cannot be listed.

Found this topic https://learn.microsoft.com/en-us/answers/questions/878813/polybase-s3-not-working-on-sql-server-2022-linux where Microsoft employee is saying that "We are working on this; it has to do with wildcard expansion." and later on "Not in the roadmap, but things can always change.". It looks like it's not implemented yet, unfortunately.

Note: single files can be read from any folder in S3 bucket.

Upvotes: 1

You don't need add '**' end of bucket path.

See: https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver16&tabs=dedicated#location--folder_or_filepath

CREATE EXTERNAL TABLE tb_Test
(
   coluna_1 INT,
   coluna_2 VARCHAR(100)
)
WITH 
(
   LOCATION = '/testeParquet/',
   DATA_SOURCE = s3_ds, 
   FILE_FORMAT = ParquetFileFormat
);
GO

Upvotes: 0

Related Questions