szarghani
szarghani

Reputation: 163

Use Azure Data Factory (without MapReduce) to import and unzip archives to a Blob Storage

I have to import various zip files from an ftp server to some Azure SQL tables. Every zip file contains up to 10 text (csv) files with different structure, i.e the rows in the first text file are of the form

"1|Monday|2017-03-20|345671"

while the rows in the second text file are of the form, say,

"abc|345894|xyz||2|yyy|true|3".

I didn't want to use MapReduce (or custom activities), as it is expensive and slow (Microsoft Azure Support suggested to use HDInsight (on demand) MapReduce activity to unzip the files first).

Upvotes: 2

Views: 646

Answers (1)

Paul Andrew
Paul Andrew

Reputation: 3253

Define the necessary Linked service, for me an FTP Linked Service, and then use a Dataset of type FileShare to get the files from the source (FTP Server). In this Dataset, assert that the files are compressed:

{
    "name": "myFTPFileInput",
    "properties": {
    "published": false,
    "type": "FileShare",
    "linkedServiceName": "myFTPLinkedService",
    "typeProperties": {
        "fileName": "xyz20170316.zip",
        "useBinaryTransfer": "true",
        "folderPath": "/Products/xyzProducts",
        "compression": {
            "type": "ZipDeflate",
            "level": "Optimal"
        }
    },
    "availability": {
        "frequency": "Day",
        "interval": 15
    },
    "external": true,
    "policy": {}
    }
}

Use a Blobsink to write the files in to blob storage:

{
    "name": "myAzureBlobOutput",
    "properties": {
    "published": false,
    "type": "AzureBlob",
    "linkedServiceName": "myAzureStorageLinkedService",
    "typeProperties": {
        "folderPath": "mytest/ftp/xyz/{Year}/{Month}",
        "format": {
            "type": "TextFormat",
            "rowDelimiter": "\n",
            "columnDelimiter": "|"
        },
        "partitionedBy": [
            {
                "name": "Year",
                "value": {
                    "type": "DateTime",
                    "date": "SliceStart",
                    "format": "yyyy"
                }
            },
            {
                "name": "Month",
                "value": {
                    "type": "DateTime",
                    "date": "SliceStart",
                    "format": "MM"
                }
            }
        ]
    },
    "availability": {
        "frequency": "Day",
        "interval": 15
    }
  }
}

The Data will be decompressed and written to the specified folder as text. From there I can use a standard ADF Copy Activity to import each file to the corresponding Azure SQL table.

Hope this helps

Upvotes: 1

Related Questions