shankar gopinath
shankar gopinath

Reputation: 63

Loading data from ADLS Gen 2 into Azure Synapse

I am trying to load Parquet files from ADLS Gen2 to Synapse using polybase external table feature.

Below is the code, but when running the create external table command, the query never completes. On cancelling the query execution, i see this error -

External file access failed due to internal error: 'Error occurred while accessing HDFS: Java exception raised on call to HdfsBridge_IsDirExist. Java exception message: HdfsBridge::isDirExist - Unexpected error encountered checking whether directory exists or not: UnknownHostException: ''.azuredatalakestore.dfs.core.windows.net'

SQL query

CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<password>';
GO

DROP CREDENTIAL ADLSCredential
CREATE DATABASE SCOPED CREDENTIAL ADLSCredential
WITH
    IDENTITY = 'user',
    SECRET = '<secret-key>'
;

CREATE EXTERNAL DATA SOURCE AzureDataLakeStorage
WITH (
    TYPE = HADOOP,
    LOCATION = 'abfss://<container>@<storage-account>.azuredatalakestore.dfs.core.windows.net',
    CREDENTIAL = ADLSCredential
);

-- Create an external file format for PARQUET files.  
CREATE EXTERNAL FILE FORMAT parquet  
WITH (  
    FORMAT_TYPE = PARQUET,  
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'  
); 

CREATE EXTERNAL FILE FORMAT uncompressedcsv
WITH (
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS (
        FIELD_TERMINATOR = ',',
        STRING_DELIMITER = '',
        DATE_FORMAT = '',
        USE_TYPE_DEFAULT = False
    )
);

CREATE EXTERNAL TABLE [dbo].[CashReceipts_external] (
    [AMOUNT_APPLIED] [float] NOT NULL,
    [TRX_NUMBER] [nvarchar](50) NULL,
    [SHORT_NAME] [nvarchar](50) NOT NULL,
    [NAME] [nvarchar](1) NULL,
    [CURRENT_RECORD_FLAG] [nvarchar](50) NULL,
    [CURRENCY_CODE] [nvarchar](50) NULL,
    [FUNC_CURRENCY_CODE] [nvarchar](50) NOT NULL,
    [CASH_RCPT_AMOUNT] [float] NULL,
    [CASH_HISTORY_AMOUNT] [float] NULL,
    [FUNC_AMT_HISTORY] [float] NULL,
    [STATUS] [nvarchar](50) NULL,
    [ANTICIPATED_CLEARING_DATE] [nvarchar](50) NULL,
    [CASH_HISTORY_EXCHANGE_RATE] [nvarchar](50) NULL,
    [GL_DATE] [datetime2](7) NULL,
    [GL_PERIOD] [datetime2](7) NOT NULL,
    [BATCH_GL_DATE] [nvarchar](1) NULL,
    [EXCHANGE_RATE] [nvarchar](50) NULL,
    [RECEIPT_NUMBER] [nvarchar](50) NULL,
    [DEPOSIT_DATE] [datetime2](7) NULL,
    [RECEIPT_DATE] [datetime2](7) NULL,
    [ISSUE_DATE] [nvarchar](1) NULL,
    [TYPE] [nvarchar](50) NULL,
    [GL_POSTED_DATE] [datetime2](7) NULL,
    [AMOUNT] [float] NULL
)
WITH
(
    LOCATION='parquetfiles'
,   DATA_SOURCE = AzureDataLakeStorage
,   FILE_FORMAT = parquet
,   REJECT_TYPE = VALUE
,   REJECT_VALUE = 0
)
;

Upvotes: 2

Views: 3780

Answers (2)

shankar gopinath
shankar gopinath

Reputation: 63

The Create External Data source command has an attribute location

LOCATION = 'abfss://<container>@<storage-account>.azuredatalakestore.dfs.core.windows.net'

It should have been

LOCATION = 'abfss://<container>@<storage-account>.dfs.core.windows.net'

I got it mixed up with ADLS Gen 1 Location attribute. My bad.Thanks to all for taking time to look into this.Marking this as closed. I ended up using the AAD app registration token instead of the storage key.

Upvotes: 2

Leon Yue
Leon Yue

Reputation: 16431

According the error message, the error is caused by the location "parquetfiles".

Please try the bellow CREATE EXTERNAL DATA SOURCE command:

CREATE EXTERNAL DATA SOURCE AzureDataLakeStorage
WITH
  ( LOCATION = 'wasbs://<container>@<storage_account>.blob.core.windows.net' ,
    CREDENTIAL = AzureStorageCredential ,
    TYPE = BLOB_STORAGE
  ) ;

When CREATE EXTERNAL TABLE, use the file or folder name, :

CREATE EXTERNAL TABLE [dbo].[CashReceipts_external] (
    [AMOUNT_APPLIED] [float] NOT NULL,
    [TRX_NUMBER] [nvarchar](50) NULL,
    [SHORT_NAME] [nvarchar](50) NOT NULL,
    [NAME] [nvarchar](1) NULL,
    [CURRENT_RECORD_FLAG] [nvarchar](50) NULL,
    [CURRENCY_CODE] [nvarchar](50) NULL,
    [FUNC_CURRENCY_CODE] [nvarchar](50) NOT NULL,
    [CASH_RCPT_AMOUNT] [float] NULL,
    [CASH_HISTORY_AMOUNT] [float] NULL,
    [FUNC_AMT_HISTORY] [float] NULL,
    [STATUS] [nvarchar](50) NULL,
    [ANTICIPATED_CLEARING_DATE] [nvarchar](50) NULL,
    [CASH_HISTORY_EXCHANGE_RATE] [nvarchar](50) NULL,
    [GL_DATE] [datetime2](7) NULL,
    [GL_PERIOD] [datetime2](7) NOT NULL,
    [BATCH_GL_DATE] [nvarchar](1) NULL,
    [EXCHANGE_RATE] [nvarchar](50) NULL,
    [RECEIPT_NUMBER] [nvarchar](50) NULL,
    [DEPOSIT_DATE] [datetime2](7) NULL,
    [RECEIPT_DATE] [datetime2](7) NULL,
    [ISSUE_DATE] [nvarchar](1) NULL,
    [TYPE] [nvarchar](50) NULL,
    [GL_POSTED_DATE] [datetime2](7) NULL,
    [AMOUNT] [float] NULL
)
WITH
(
    LOCATION='[filename]'
,   DATA_SOURCE = AzureDataLakeStorage
,   FILE_FORMAT = parquet
,   REJECT_TYPE = VALUE
,   REJECT_VALUE = 0
)
;

Ref:

  1. Create external data source to reference Azure blob storage
  2. Create external table: Arguments:

LOCATION = 'folder_or_filepath' Specifies the folder or the file path and file name for the actual data in Hadoop or Azure blob storage.

If you specify LOCATION to be a folder, a PolyBase query that selects from the external table will retrieve files from the folder and all of its subfolders. Just like Hadoop, PolyBase doesn't return hidden folders. It also doesn't return files for which the file name begins with an underline (_) or a period (.).

In this example, if LOCATION='/webdata/', a PolyBase query will return rows from mydata.txt and mydata2.txt. It won't return mydata3.txt because it's a file in a hidden folder. And it won't return _hidden.txt because it's a hidden file.

Please note that: One file for one table, we could load multiple files to create the external table!

Hope this helps.

Upvotes: 0

Related Questions