Nag
Nag

Reputation: 2057

How Openrowset works internally when the data is loaded

I am going through the azure documentation and come across the following phrase

OPENROWSET function in Synapse SQL reads the content of the file(s) from a data source. The data source is an Azure storage account and it can be explicitly referenced in the OPENROWSET function or can be dynamically inferred from URL of the files that you want to read.

  1. where does the data is loaded and processed - is it in memory . Does it load the data in chunks similar to spark does ?
  2. And also it seems Openrowset is supported with serverless sql pool and not supported with dedicated sql pool - what could have been the rationale in doing so , though both the pools backed up by MS sql server which actually natively supports OPENROWSET.

Upvotes: 0

Views: 1898

Answers (1)

SiddheshDesai
SiddheshDesai

Reputation: 8167

OPENROWSET function in Synapse SQL reads the content of the file(s) from a data source. The data source is an Azure storage account and it can be explicitly referenced in the OPENROWSET function or can be dynamically inferred from URL of the files that you want to read.

  1. where does the data is loaded and processed - is it in memory . Does it load the data in chunks similar to spark does ?

As, OPENROWSET function is only supported in Serverless Synapse SQL. For now, It uses Serverless architecture, There’s one Compute Node, that scales distributed computes according to the needs. Your data is queried in multiple distributed small tasks backed by a compute node unlike dedicated compute node for each task in Dedicated synapse SQL. Distributed Query Processing Engine in Serverless SQL will convert all your SQL queries in a small task and assign those tasks to a Compute node, which will query data from storage account. Serverless Spark pool and Serverless SQL both work on the same architecture of scaling compute when needed to run the queries and scale them down once they are not needed.

enter image description here

Image reference - Synapse SQL architecture - Azure Synapse Analytics | Microsoft Learn

To read and access files from Azure Storage 2 types of methods are used.
OPENROWSET and External Table.

OPENROWSET is used to get the data in the azure storage in the form of row-set, It can be used to connect to remote data source with various azure ad authentication, or It can be used to get bulk data to fetch multiple datasets in the form of row-set from azure storage directly. It is similar to the FROM clause of SQL.

External Table is used to read data located in Hadoop, Azure Storage, Azure Storage Blob, Data lake storage.

And also it seems Openrowset is supported with serverless sql pool and
not supported with dedicated sql pool - what could have been the
rationale in doing so , though both the pools backed up by MS sql
server which actually natively supports OPENROWSET.

To connect to an in-frequent reference to a data source OPENROWSET or OPENDATASOURCE methods are used natively with information specified to connect to infrequently accessed Linked Server. The Rowset is then referenced as a transact SQL statement in an SQL Table.
For now, Azure dedicated Synapse SQL does not support OPENROWSET function.
Refer here :-
https://learn.microsoft.com/en-us/sql/t-sql/functions/openrowset-transact-sql?view=sql-server-ver16

OPENROWSET() for Synapse dedicated pools? BY [Stefan Azarić]

Query :-

    OPENROWSET
   ({ BULK 'unstructured_data_path' . [DATA_SOURCE = <data source name>, ]
      FORMAT ['PARQUET' | 'DELTA'] }
   )
   [WITH ( {'column_name' 'column_type' }) ]
   [AS] table_alias(column_alias, ...n)

Openrowset uses a FROM clause with bulk with data source set to Azure storage account with format supported for csv, parquet, delta, json.

enter image description here

SELECT *
FROM OPENROWSET(
   BULK '<storagefile-url>,
   FORMAT = '<format-of-file>
   PARSER_VERSION = '2.0'
   HEADER_ROW = True
) as rowsFromFile

enter image description here

WITH CLAUSE -

SELECT *
FROM OPENROWSET(
   BULK '<storagefile-url>,
   FORMAT = '<format-of-file>
   PARSER_VERSION = '2.0'
   HEADER_ROW = True
)
WITH
(
   columnname 
) as output-table

enter image description here

As, this is based on Serverless architecture > each query is distributed in small tasks and ran by a compute node.

Upvotes: 1

Related Questions