Reputation: 2882
Here are the details of this performance test (very simple). I'm trying to understand why running data flows in the cloud native Azure Data Factory environment (Spark) is so much slower than running data flows hosted in Azure SSIS IR. My results show that running in latest ADFv2 is over 4 times slower than running the exact same data flow in Azure SSIS (even with a warm IR cluster already warmed up from previous run). I like all the new features of the v2 data flows but it hardly seems worth the performance hit unless I'm completely missing something. Eventually I'll be adding more complex data flows but wanted to understand base performance behavior.
Source: 1GB CSV stored in blob storage
Destination: Azure SQL Server Database (one table and truncated before each run)
When using control flow in ADFv2 using a simple CopyActivity (no data flow)
91 seconds
When using native SSIS package with data flow (Azure Feature Pack to pull from same blob storage) running Azure SSIS with 8 cores.
76 seconds
Pure ADF Cloud Pipeline using DataFlow with warm Azure IR (cached from previous run) 8 (+ 8 Driver cores) with default partitioning (Spark) (includes 96 seconds cluster startup which is another thing I don't understand since the TTL is 30 minutes on the IR and it was just ran 10 minutes prior)
360 seconds
Pipeline (LandWithCopy)
{ "name": "LandWithCopy", "properties": { "activities": [ { "name": "CopyData", "type": "Copy", "dependsOn": [], "policy": { "timeout": "7.00:00:00", "retry": 0, "retryIntervalInSeconds": 30, "secureOutput": false, "secureInput": false }, "userProperties": [], "typeProperties": { "source": { "type": "DelimitedTextSource", "storeSettings": { "type": "AzureBlobStorageReadSettings", "recursive": true, "wildcardFileName": "data.csv", "enablePartitionDiscovery": false }, "formatSettings": { "type": "DelimitedTextReadSettings" } }, "sink": { "type": "AzureSqlSink", "preCopyScript": "TRUNCATE TABLE PatientAR", "disableMetricsCollection": false }, "enableStaging": false, "translator": { "type": "TabularTranslator", "mappings": [ { "source": { "name": "RecordAction", "type": "String" }, "sink": { "name": "RecordAction", "type": "String" } }, { "source": { "name": "UniqueId", "type": "String" }, "sink": { "name": "UniqueId", "type": "String" } }, { "source": { "name": "Type", "type": "String" }, "sink": { "name": "Type", "type": "String" } }, { "source": { "name": "TypeDescription", "type": "String" }, "sink": { "name": "TypeDescription", "type": "String" } }, { "source": { "name": "PatientId", "type": "String" }, "sink": { "name": "PatientId", "type": "String" } }, { "source": { "name": "PatientVisitId", "type": "String" }, "sink": { "name": "PatientVisitId", "type": "String" } }, { "source": { "name": "VisitDateOfService", "type": "String" }, "sink": { "name": "VisitDateOfService", "type": "String" } }, { "source": { "name": "VisitDateOfEntry", "type": "String" }, "sink": { "name": "VisitDateOfEntry", "type": "String" } }, { "source": { "name": "DoctorId", "type": "String" }, "sink": { "name": "DoctorId", "type": "String" } }, { "source": { "name": "DoctorName", "type": "String" }, "sink": { "name": "DoctorName", "type": "String" } }, { "source": { "name": "FacilityId", "type": "String" }, "sink": { "name": "FacilityId", "type": "String" } }, { "source": { "name": "FacilityName", "type": "String" }, "sink": { "name": "FacilityName", "type": "String" } }, { "source": { "name": "CompanyName", "type": "String" }, "sink": { "name": "CompanyName", "type": "String" } }, { "source": { "name": "TicketNumber", "type": "String" }, "sink": { "name": "TicketNumber", "type": "String" } }, { "source": { "name": "TransactionDateOfEntry", "type": "String" }, "sink": { "name": "TransactionDateOfEntry", "type": "String" } }, { "source": { "name": "InternalCode", "type": "String" }, "sink": { "name": "InternalCode", "type": "String" } }, { "source": { "name": "ExternalCode", "type": "String" }, "sink": { "name": "ExternalCode", "type": "String" } }, { "source": { "name": "Description", "type": "String" }, "sink": { "name": "Description", "type": "String" } }, { "source": { "name": "Fee", "type": "String" }, "sink": { "name": "Fee", "type": "String" } }, { "source": { "name": "Units", "type": "String" }, "sink": { "name": "Units", "type": "String" } }, { "source": { "name": "AREffect", "type": "String" }, "sink": { "name": "AREffect", "type": "String" } }, { "source": { "name": "Action", "type": "String" }, "sink": { "name": "Action", "type": "String" } }, { "source": { "name": "InsuranceGroup", "type": "String" }, "sink": { "name": "InsuranceGroup", "type": "String" } }, { "source": { "name": "Payer", "type": "String" }, "sink": { "name": "Payer", "type": "String" } }, { "source": { "name": "PayerType", "type": "String" }, "sink": { "name": "PayerType", "type": "String" } }, { "source": { "name": "PatBalance", "type": "String" }, "sink": { "name": "PatBalance", "type": "String" } }, { "source": { "name": "InsBalance", "type": "String" }, "sink": { "name": "InsBalance", "type": "String" } }, { "source": { "name": "Charges", "type": "String" }, "sink": { "name": "Charges", "type": "String" } }, { "source": { "name": "Payments", "type": "String" }, "sink": { "name": "Payments", "type": "String" } }, { "source": { "name": "Adjustments", "type": "String" }, "sink": { "name": "Adjustments", "type": "String" } }, { "source": { "name": "TransferAmount", "type": "String" }, "sink": { "name": "TransferAmount", "type": "String" } }, { "source": { "name": "FiledAmount", "type": "String" }, "sink": { "name": "FiledAmount", "type": "String" } }, { "source": { "name": "CheckNumber", "type": "String" }, "sink": { "name": "CheckNumber", "type": "String" } }, { "source": { "name": "CheckDate", "type": "String" }, "sink": { "name": "CheckDate", "type": "String" } }, { "source": { "name": "Created", "type": "String" }, "sink": { "name": "Created", "type": "String" } }, { "source": { "name": "ClientTag", "type": "String" }, "sink": { "name": "ClientTag", "type": "String" } } ] } }, "inputs": [ { "referenceName": "PAR_Source_DS", "type": "DatasetReference" } ], "outputs": [ { "referenceName": "PAR_Sink_DS", "type": "DatasetReference" } ] } ], "annotations": [] } }
Pipeline Data Flow (LandWithFlow) { "name": "WriteData", "properties": { "type": "MappingDataFlow", "typeProperties": { "sources": [ { "dataset": { "referenceName": "PAR_Source_DS", "type": "DatasetReference" }, "name": "GetData" } ], "sinks": [ { "dataset": { "referenceName": "PAR_Sink_DS", "type": "DatasetReference" }, "name": "WriteData" } ], "transformations": [], "script": "source(output(\n\t\tRecordAction as string,\n\t\tUniqueId as string,\n\t\tType as string,\n\t\tTypeDescription as string,\n\t\tPatientId as string,\n\t\tPatientVisitId as string,\n\t\tVisitDateOfService as string,\n\t\tVisitDateOfEntry as string,\n\t\tDoctorId as string,\n\t\tDoctorName as string,\n\t\tFacilityId as string,\n\t\tFacilityName as string,\n\t\tCompanyName as string,\n\t\tTicketNumber as string,\n\t\tTransactionDateOfEntry as string,\n\t\tInternalCode as string,\n\t\tExternalCode as string,\n\t\tDescription as string,\n\t\tFee as string,\n\t\tUnits as string,\n\t\tAREffect as string,\n\t\tAction as string,\n\t\tInsuranceGroup as string,\n\t\tPayer as string,\n\t\tPayerType as string,\n\t\tPatBalance as string,\n\t\tInsBalance as string,\n\t\tCharges as string,\n\t\tPayments as string,\n\t\tAdjustments as string,\n\t\tTransferAmount as string,\n\t\tFiledAmount as string,\n\t\tCheckNumber as string,\n\t\tCheckDate as string,\n\t\tCreated as string,\n\t\tClientTag as string\n\t),\n\tallowSchemaDrift: true,\n\tvalidateSchema: false,\n\twildcardPaths:['data.csv']) ~> GetData\nGetData sink(input(\n\t\tRecordAction as string,\n\t\tUniqueId as string,\n\t\tType as string,\n\t\tTypeDescription as string,\n\t\tPatientId as string,\n\t\tPatientVisitId as string,\n\t\tVisitDateOfService as string,\n\t\tVisitDateOfEntry as string,\n\t\tDoctorId as string,\n\t\tDoctorName as string,\n\t\tFacilityId as string,\n\t\tFacilityName as string,\n\t\tCompanyName as string,\n\t\tTicketNumber as string,\n\t\tTransactionDateOfEntry as string,\n\t\tInternalCode as string,\n\t\tExternalCode as string,\n\t\tDescription as string,\n\t\tFee as string,\n\t\tUnits as string,\n\t\tAREffect as string,\n\t\tAction as string,\n\t\tInsuranceGroup as string,\n\t\tPayer as string,\n\t\tPayerType as string,\n\t\tPatBalance as string,\n\t\tInsBalance as string,\n\t\tCharges as string,\n\t\tPayments as string,\n\t\tAdjustments as string,\n\t\tTransferAmount as string,\n\t\tFiledAmount as string,\n\t\tCheckNumber as string,\n\t\tCheckDate as string,\n\t\tCreated as string,\n\t\tClientTag as string,\n\t\tFileName as string,\n\t\tPractice as string\n\t),\n\tallowSchemaDrift: true,\n\tvalidateSchema: false,\n\tdeletable:false,\n\tinsertable:true,\n\tupdateable:false,\n\tupsertable:false,\n\tformat: 'table',\n\tpreSQLs:['TRUNCATE TABLE PatientAR'],\n\tmapColumn(\n\t\tRecordAction,\n\t\tUniqueId,\n\t\tType,\n\t\tTypeDescription,\n\t\tPatientId,\n\t\tPatientVisitId,\n\t\tVisitDateOfService,\n\t\tVisitDateOfEntry,\n\t\tDoctorId,\n\t\tDoctorName,\n\t\tFacilityId,\n\t\tFacilityName,\n\t\tCompanyName,\n\t\tTicketNumber,\n\t\tTransactionDateOfEntry,\n\t\tInternalCode,\n\t\tExternalCode,\n\t\tDescription,\n\t\tFee,\n\t\tUnits,\n\t\tAREffect,\n\t\tAction,\n\t\tInsuranceGroup,\n\t\tPayer,\n\t\tPayerType,\n\t\tPatBalance,\n\t\tInsBalance,\n\t\tCharges,\n\t\tPayments,\n\t\tAdjustments,\n\t\tTransferAmount,\n\t\tFiledAmount,\n\t\tCheckNumber,\n\t\tCheckDate,\n\t\tCreated,\n\t\tClientTag\n\t),\n\tskipDuplicateMapInputs: true,\n\tskipDuplicateMapOutputs: true) ~> WriteData" } } }
Upvotes: 1
Views: 1630
Reputation: 45
We are having the Same issues. Copy Activity without Data Flow is much faster than Data Flow. Our case is Copy Activity vs Data Flow. Not Sure if I'm doing anything wrong.
Our Scenario is just copy from Source to Destination 13 tables based on where Clause. We now have two copy activity which takes 1.5 minutes. So I was thinking may be create Data Flow and do one Source two Sinks. But it's running like 5 minutes to 8 Minutes depending on Cluster startup time. Hope we get an answer.
Upvotes: 0