Reputation: 63
I am trying to load Parquet files from ADLS Gen2 to Synapse using polybase external table feature.
Below is the code, but when running the create external table command, the query never completes. On cancelling the query execution, i see this error -
External file access failed due to internal error: 'Error occurred while accessing HDFS: Java exception raised on call to HdfsBridge_IsDirExist. Java exception message: HdfsBridge::isDirExist - Unexpected error encountered checking whether directory exists or not: UnknownHostException: ''.azuredatalakestore.dfs.core.windows.net'
SQL query
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<password>';
GO
DROP CREDENTIAL ADLSCredential
CREATE DATABASE SCOPED CREDENTIAL ADLSCredential
WITH
IDENTITY = 'user',
SECRET = '<secret-key>'
;
CREATE EXTERNAL DATA SOURCE AzureDataLakeStorage
WITH (
TYPE = HADOOP,
LOCATION = 'abfss://<container>@<storage-account>.azuredatalakestore.dfs.core.windows.net',
CREDENTIAL = ADLSCredential
);
-- Create an external file format for PARQUET files.
CREATE EXTERNAL FILE FORMAT parquet
WITH (
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
CREATE EXTERNAL FILE FORMAT uncompressedcsv
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = ',',
STRING_DELIMITER = '',
DATE_FORMAT = '',
USE_TYPE_DEFAULT = False
)
);
CREATE EXTERNAL TABLE [dbo].[CashReceipts_external] (
[AMOUNT_APPLIED] [float] NOT NULL,
[TRX_NUMBER] [nvarchar](50) NULL,
[SHORT_NAME] [nvarchar](50) NOT NULL,
[NAME] [nvarchar](1) NULL,
[CURRENT_RECORD_FLAG] [nvarchar](50) NULL,
[CURRENCY_CODE] [nvarchar](50) NULL,
[FUNC_CURRENCY_CODE] [nvarchar](50) NOT NULL,
[CASH_RCPT_AMOUNT] [float] NULL,
[CASH_HISTORY_AMOUNT] [float] NULL,
[FUNC_AMT_HISTORY] [float] NULL,
[STATUS] [nvarchar](50) NULL,
[ANTICIPATED_CLEARING_DATE] [nvarchar](50) NULL,
[CASH_HISTORY_EXCHANGE_RATE] [nvarchar](50) NULL,
[GL_DATE] [datetime2](7) NULL,
[GL_PERIOD] [datetime2](7) NOT NULL,
[BATCH_GL_DATE] [nvarchar](1) NULL,
[EXCHANGE_RATE] [nvarchar](50) NULL,
[RECEIPT_NUMBER] [nvarchar](50) NULL,
[DEPOSIT_DATE] [datetime2](7) NULL,
[RECEIPT_DATE] [datetime2](7) NULL,
[ISSUE_DATE] [nvarchar](1) NULL,
[TYPE] [nvarchar](50) NULL,
[GL_POSTED_DATE] [datetime2](7) NULL,
[AMOUNT] [float] NULL
)
WITH
(
LOCATION='parquetfiles'
, DATA_SOURCE = AzureDataLakeStorage
, FILE_FORMAT = parquet
, REJECT_TYPE = VALUE
, REJECT_VALUE = 0
)
;
Upvotes: 2
Views: 3780
Reputation: 63
The Create External Data source command has an attribute location
LOCATION = 'abfss://<container>@<storage-account>.azuredatalakestore.dfs.core.windows.net'
It should have been
LOCATION = 'abfss://<container>@<storage-account>.dfs.core.windows.net'
I got it mixed up with ADLS Gen 1 Location attribute. My bad.Thanks to all for taking time to look into this.Marking this as closed. I ended up using the AAD app registration token instead of the storage key.
Upvotes: 2
Reputation: 16431
According the error message, the error is caused by the location "parquetfiles".
Please try the bellow CREATE EXTERNAL DATA SOURCE
command:
CREATE EXTERNAL DATA SOURCE AzureDataLakeStorage
WITH
( LOCATION = 'wasbs://<container>@<storage_account>.blob.core.windows.net' ,
CREDENTIAL = AzureStorageCredential ,
TYPE = BLOB_STORAGE
) ;
When CREATE EXTERNAL TABLE
, use the file or folder name, :
CREATE EXTERNAL TABLE [dbo].[CashReceipts_external] (
[AMOUNT_APPLIED] [float] NOT NULL,
[TRX_NUMBER] [nvarchar](50) NULL,
[SHORT_NAME] [nvarchar](50) NOT NULL,
[NAME] [nvarchar](1) NULL,
[CURRENT_RECORD_FLAG] [nvarchar](50) NULL,
[CURRENCY_CODE] [nvarchar](50) NULL,
[FUNC_CURRENCY_CODE] [nvarchar](50) NOT NULL,
[CASH_RCPT_AMOUNT] [float] NULL,
[CASH_HISTORY_AMOUNT] [float] NULL,
[FUNC_AMT_HISTORY] [float] NULL,
[STATUS] [nvarchar](50) NULL,
[ANTICIPATED_CLEARING_DATE] [nvarchar](50) NULL,
[CASH_HISTORY_EXCHANGE_RATE] [nvarchar](50) NULL,
[GL_DATE] [datetime2](7) NULL,
[GL_PERIOD] [datetime2](7) NOT NULL,
[BATCH_GL_DATE] [nvarchar](1) NULL,
[EXCHANGE_RATE] [nvarchar](50) NULL,
[RECEIPT_NUMBER] [nvarchar](50) NULL,
[DEPOSIT_DATE] [datetime2](7) NULL,
[RECEIPT_DATE] [datetime2](7) NULL,
[ISSUE_DATE] [nvarchar](1) NULL,
[TYPE] [nvarchar](50) NULL,
[GL_POSTED_DATE] [datetime2](7) NULL,
[AMOUNT] [float] NULL
)
WITH
(
LOCATION='[filename]'
, DATA_SOURCE = AzureDataLakeStorage
, FILE_FORMAT = parquet
, REJECT_TYPE = VALUE
, REJECT_VALUE = 0
)
;
Ref:
LOCATION = 'folder_or_filepath' Specifies the folder or the file path and file name for the actual data in Hadoop or Azure blob storage.
If you specify LOCATION to be a folder, a PolyBase query that selects from the external table will retrieve files from the folder and all of its subfolders. Just like Hadoop, PolyBase doesn't return hidden folders. It also doesn't return files for which the file name begins with an underline (_) or a period (.).
In this example, if LOCATION='/webdata/', a PolyBase query will return rows from mydata.txt and mydata2.txt. It won't return mydata3.txt because it's a file in a hidden folder. And it won't return _hidden.txt because it's a hidden file.
Please note that: One file for one table, we could load multiple files to create the external table!
Hope this helps.
Upvotes: 0