Indra
Indra

Reputation: 61

What is the difference between polybase and bulk insert, copy methods in azure data factory and when to use them?

Can some one please elaborate on when to use Polybase versus bulk insert in azure datafactory, what are the differences between these two copy methods?

Upvotes: 6

Views: 15130

Answers (2)

Dinesh vishe
Dinesh vishe

Reputation: 3598

SQL Server PolyBase and its way to fetch data from the text file stored in the Azure Data Storage. The PolyBase works as an intermediate for communication between the Azure Data Storage and SQL Server.

Copy command:-

CREATE TABLE [dbo].[Trip] (

[DateID] int NOT NULL,

[MedallionID] int NOT NULL,

[HackneyLicenseID] int NOT NULL,

[PickupTimeID] int NOT NULL,

[DropoffTimeID] int NOT NULL,

[PickupGeographyID] int NULL,

[DropoffGeographyID] int NULL,

[PickupLatitude] float NULL,

[PickupLatLong] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

[DropoffLatitude] float NULL,

[DropoffLatLong] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

[PassengerCount] int NULL,

[TripDurationSeconds] int NULL,

[TripDistanceMiles] float NULL,
[PaymentType] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[FareAmount] money NULL,
[SurchargeAmount] money NULL,
[TaxAmount] money NULL,

[TipAmount] money NULL,

[TollsAmount] money NULL,

[TotalAmount] money NULL

) WITH ( DISTRIBUTION = ROUND_ROBIN,

CLUSTERED COLUMNSTORE INDEX

);

COPY INTO Trip FROM 'https://nytaxiblob.blob.core.windows.net/2013/Trip2013/QID6392_20171107_05910_0.txt.gz' WITH ( FILE_TYPE = 'CSV',

FIELDTERMINATOR = '|',

FIELDQUOTE = '',

ROWTERMINATOR='0X0A',

COMPRESSION = 'GZIP'

) OPTION (LABEL = 'COPY : Load');

Upvotes: 0

GregGalloway
GregGalloway

Reputation: 11625

The two options labeled “Polybase” and the “COPY command” are only applicable to Azure Synapse Analytics (formerly Azure SQL Data Warehouse). They are both fast methods of loading which involve staging data in Azure storage (if it’s not already in Azure Storage) and using a fast, highly parallel method of loading to each compute node from storage. Especially on large tables these options are preferred due to their scalability but they do come with some restrictions documented at the link above.

In contrast, on Azure Synapse Analytics a bulk insert is a slower load method which loads data through the control node and is not as highly parallel or performant. It is an order of magnitude slower on large files. But it can be more forgiving in terms of data types and file formatting.

On other Azure SQL databases, bulk insert is the preferred and fast method.

Upvotes: 9

Related Questions