Steve
Steve

Reputation: 59

Data factory azure blob source - wildcard

I was led to believe that you can wildcard the filename property in an Azure Blob Table source object.

I want to pick up only certain csv files from blob storage that exist in the same directory as other files I don't want to process:

i.e.

root/data/GUJH-01.csv

root/data/GUJH-02.csv

root/data/DFGT-01.csv

I want to process GUJH*.csv and not DFGT-01.csv

Is this possible? If so, why is my blob source validation failing, informing me that the file does not exist (The message reports that the root/data blob does not exist.

Thanks in advance.

Upvotes: 2

Views: 3369

Answers (2)

Nick.Mc
Nick.Mc

Reputation: 19215

Just adding some more detail here because I'm finding this a very difficult learning curve and I'd like to document this for my sake and others.

Given a sample file like this (no extensions in this case) in blob storage,

ZZZZ_20170727_1324

We can see the middle part is in yyyyMMdd format.

This is uploaded to folder Landing inside container MyContainer

this was part of my dataset definition::

    "typeProperties": {
        "folderPath": "MyContainer/Landing/ZZZZ_{DayCode}",
        "format": {
            "type": "TextFormat",
            "columnDelimiter": "\u0001"
        },
        "partitionedBy": [
            {
                "name": "DayCode",
                "value": {
                    "type": "DateTime",
                    "date": "SliceStart",
                    "format": "yyyyMMdd"
                }
            }
        ]
    },

Note that it's a 'prefix', which you will see in the log / error messages, if you can find them (good luck)

If you want to test loading this particular file you need to press the 'Diagram' button, then drill into your pipeline until you find the target dataset - the one the file is being loaded into (I am loading this into SQL Azure). Click on the target dataset, now go and find the correct timeslice. In my case I need to find the timeslice with a start timeslice of 20170727 and run that one.

This will make sure the correct file is picked up and loaded in to SQL Azure

Forget about manually running pipelines or activities - thats just not how it works. You need to run the output dataset under a timeslice to pull it through.

Upvotes: 0

Steve
Steve

Reputation: 59

Answering my own question..

There's not a wildcard but there is a 'Starts With' which will work in my scenario:

Instead of root/data/GUJH*.csv I can do root/data/GUJH on the folderPath property and it will bring in all root/data/GUJH files..

:)

Upvotes: 3

Related Questions