robert_553z8
robert_553z8

Reputation: 191

renaming files in a nested directory with azure data factory

I have a daily export set up for several subscriptions - the files export like so

enter image description here

with 7 different directories within daily -- i'm simply trying to rename the files to get rid of the underscore for data flows

my parent pipeline looks like so

enter image description here

get metadata gets the folder names and for each invokes the child pipeline like so

enter image description here

here are the screen grabs of the child pipeline

enter image description here

copy data within the foreach1 -- the source

enter image description here

and now the sink - this is where i want to rename the file, the first time i debugged it simply copied them to the correct place with a .txt extension, the next time it got the extension right but it is not renaming the file,

enter image description here

i replaced @replace(item().name, '_', '-') with @replace(activity('FileInfo').output.itemName, '_','-') and got the following error

The expression '@replace(activity('FileInfo').output.itemName, '_','-')' cannot be evaluated because property 'itemName' doesn't exist, available properties are 'childItems, effectiveIntegrationRuntime, executionDuration, durationInQueue, billingReference'.

so then I replaced that with @replace(activity('FileInfo').output.childItems, '_', '-') but that gives the following error Cannot fit childItems return type into the function parameter string

I'm not sure where to go from here

edit 7/14 making the change from the answer below

here is my linked service for the sink dataset with the parameter renamedFile

enter image description here

enter image description here

here is the sink on the copy data1 for the child_Rename pipeline, it grayed out the file extension as this was mentioned

enter image description here

now here is the sink container after running the pipeline

enter image description here

this is the directory structure of the source data - it's dynamically created from scheduled daily azure exports

enter image description here

here is the output of get metadata - FileInfo from the child pipeline

{
    "childItems": [
        {
            "name": "daily",
            "type": "Folder"
        }
    ],
    "effectiveIntegrationRuntime": "integrationRuntime1 (Central US)",
    "executionDuration": 0,
    "durationInQueue": {
        "integrationRuntimeQueue": 0
    },
    "billingReference": {
        "activityType": "PipelineActivity",
        "billableDuration": [
            {
                "meterType": "AzureIR",
                "duration": 0.016666666666666666,
                "unit": "Hours"
            }
        ]
    }
}

allsubs - source container daily - directory created by the scheduled export sub1 - subN - the different subs with scheduled exports

previous-month -> this-month - monthly folders are created automatically

this_fileXX.csv -- files are automatically generated with the underscore in the name - it is my understanding that data flows cannot handle these characters in the file name

├──allsubs/
    └── daily/
        ├── sub1/
        |   └── previous-month/
                └── this_file.csv
                └── this_file1.csv
        |   └── previous-month/
                └── this_file11.csv
                └── this_file12.csv
        |   └── this-month/
        ├── subN/
        |   └── previous-month/
        |   └── previous-month/
        |   └── this-month/
                └── this_fileXX.csv

edit 2 - july 20

I think i'm getting closer but there are still some small errors i do not see

the pipeline now moves all the files from the container allsubs to the container renamed-files but it is not renaming the files - it looks like so

Get Metadata -from the dataset allContainers it retrieves the folders with the Child Items

dataset allContainers shown (preview works, linked service works, no paremeters in this dataset) enter image description here

enter image description here

next the forEach activity calls the output of get metadata for the items @activity('Get Metadata1').output.childItems

enter image description here

next shown is the copy data within ForEach

the source is the allContainers dataset with the wildcard file path selected, recursively selected and due to the following error max concurrent connections set at 1 -- but this did not resolve the error

error message:

Failure happened on 'Sink' side. 
ErrorCode=AzureStorageOperationFailedConcurrentWrite,
'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,
Message=Error occurred when trying to upload a file. 
It's possible because you have multiple concurrent copy activities 
runs writing to the same file 'renamed-files/rlcosts51122/20220601-20220630/rlcosts51122_082dd29b-95b2-4da5-802a-935d762e89d8.csv'. 
Check your ADF configuration.
,Source=Microsoft.DataTransfer.ClientLibrary,
''Type=Microsoft.WindowsAzure.Storage.StorageException,
Message=The remote server returned an error: (400) Bad 
Request.,Source=Microsoft.WindowsAzure.Storage,StorageExtendedMessage=The specified block list is invalid.
RequestId:b519219f-601e-000d-6c4c-9c9c5e000000
Time:2022-07-
20T15:23:51.4342693Z,,''Type=System.Net.WebException,
Message=The remote server returned an error: (400) Bad 
Request.,Source=Microsoft.WindowsAzure.Storage,'

copy data source: enter image description here

copy data sink - the dataset is dsRenamesink, it's simply another container in a different storage account, linked service is set up correctly, it has the parameter renamedFile but I suspect this is the source of my error. still testing that.

sink dataset dsRenamesink: enter image description here

parmeter page: enter image description here

here's the sink in the copy data where the renamed file is passed the iterator from ForEach1 like so:

@replace(item().name,'_','renameworked') so the underscore would be replaced with 'renameworked' easy enough to test

enter image description here

debugging the pipeline

the errors look to be consistent for the 7 failures which was shown above as the 'failure happened on the sink side'

however - going into the storage account sink i can see that all of the files from the source were copied over to the sink but the files were not renamed like so enter image description here

pipeline output: enter image description here

error messages:

{
    "dataRead": 28901858,
    "dataWritten": 10006989,
    "filesRead": 4,
    "filesWritten": 0,
    "sourcePeakConnections": 1,
    "sinkPeakConnections": 1,
    "copyDuration": 7,
    "throughput": 4032.067,
    "errors": [
        {
            "Code": 24107,
            "Message": "Failure happened on 'Sink' side. ErrorCode=AzureStorageOperationFailedConcurrentWrite,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error occurred when trying to upload a file. It's possible because you have multiple concurrent copy activities runs writing to the same file 'renamed-files/rlcosts51122/20220601-20220630/rlcosts51122_082dd29b-95b2-4da5-802a-935d762e89d8.csv'. Check your ADF configuration.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=Microsoft.WindowsAzure.Storage.StorageException,Message=The remote server returned an error: (400) Bad Request.,Source=Microsoft.WindowsAzure.Storage,StorageExtendedMessage=The specified block list is invalid.\nRequestId:b519219f-601e-000d-6c4c-9c9c5e000000\nTime:2022-07-20T15:23:51.4342693Z,,''Type=System.Net.WebException,Message=The remote server returned an error: (400) Bad Request.,Source=Microsoft.WindowsAzure.Storage,'",
            "EventType": 0,
            "Category": 5,
            "Data": {
                "FailureInitiator": "Sink"
            },
            "MsgId": null,
            "ExceptionType": null,
            "Source": null,
            "StackTrace": null,
            "InnerEventInfos": []
        }
    ],
    "effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (Central US)",
    "usedDataIntegrationUnits": 4,
    "billingReference": {
        "activityType": "DataMovement",
        "billableDuration": [
            {
                "meterType": "AzureIR",
                "duration": 0.06666666666666667,
                "unit": "DIUHours"
            }
        ]
    },
    "usedParallelCopies": 1,
    "executionDetails": [
        {
            "source": {
                "type": "AzureBlobFS",
                "region": "Central US"
            },
            "sink": {
                "type": "AzureBlobStorage"
            },
            "status": "Failed",
            "start": "Jul 20, 2022, 10:23:44 am",
            "duration": 7,
            "usedDataIntegrationUnits": 4,
            "usedParallelCopies": 1,
            "profile": {
                "queue": {
                    "status": "Completed",
                    "duration": 3
                },
                "transfer": {
                    "status": "Completed",
                    "duration": 2,
                    "details": {
                        "listingSource": {
                            "type": "AzureBlobFS",
                            "workingDuration": 0
                        },
                        "readingFromSource": {
                            "type": "AzureBlobFS",
                            "workingDuration": 0
                        },
                        "writingToSink": {
                            "type": "AzureBlobStorage",
                            "workingDuration": 0
                        }
                    }
                }
            },
            "detailedDurations": {
                "queuingDuration": 3,
                "transferDuration": 2
            }
        }
    ],
    "dataConsistencyVerification": {
        "VerificationResult": "NotVerified"
    },
    "durationInQueue": {
        "integrationRuntimeQueue": 0
    }
}

all i wanted to do was remove the underscore from the file name to work with data flows....I'm not sure what else to try next

next attempt july 20

it appears that now I have been able to copy and rename some of the files -

changing the sink dataset as follows @concat(replace(dataset().renamedFile,'_','-'),'',formatDateTime(utcnow(),'yyyyMMddHHmmss'),'.csv') enter image description here

enter image description here

and removing this parameter from the sink in the copy activity

enter image description here

upon debugging this pipeline I get 1 file in the sink and it is named correctly but there is still something wrong enter image description here

enter image description here

third attempt 7/20

further updating to be closer to the original answer sink dataset

enter image description here

copy data activity in the sink - concat works enter image description here

now after debugging i'm left with 1 file for each of the subs - so there is something still not quite correct

enter image description here

Upvotes: 0

Views: 717

Answers (1)

Vamsi Bitra
Vamsi Bitra

Reputation: 2764

I reproduce the same thing in my environment.

Go to Sink dataset, click and open.First create parameters and add dynamic content, I used this expression @dataset().sinkfilename

Ref1

Ref2

In copy activity sink, under dataset properties pass the filename value using the expression @replace(item().name,'_','-') to replace _ with -.

when you create a dataset parameter to pass the filename, the File extension property is automatically disabled.

Ref3

when the pipeline runs you can see the file name has been renamed accordingly.

Ref5

Upvotes: -1

Related Questions