Brett Law
Brett Law

Reputation: 68

ADF Merge-Copying JSON files in Copy Data Activity creates error for Mapping Data Flow

I am trying to do some optimization in ADF. Setup is a third-party tool copies one JSON file per object to a BLOB storage container. These feed to a Mapping Data Flow. The individual files written by the third party tool work great. If I copy these files to a different BLOB folder using an Azure Copy Data activity, the MDF can no longer parse the files and gives an error: "JSON parsing error, unsupported encoding or multiline." I started this with a Merge Files, but outcome is same regardless of copy behavior I choose.

2ND EDIT: After another day's work, I have found that the Copy Activity Merge File from JSON to JSON definitely adds an EOL character to each single JSON object as it gets imported to the Merge file. I have also found that the MDF fails definitely with those EOL characters in the Merge file. If I remove all EOL characters from the Merge file, the same MDF will work. For me, this is a bug. The copy activity is adding a character that breaks the MDF. There seems to be a second issue in some of my data that doesn't fail as an individual file but does when concatenated that breaks the MDF when I try to pull all the files together, but I have tested the basic behavior on 1-5000 files and been able to repeat the fail/success tests.

I took the original file, and the copied file, ran them through all of sorts of test, what I eventually found when I dump into Notepad++: Copied file:

{"CustomerMasterData":{"Customer":[{"ID":"123456","name":"Customer Name",}]}}\r\n

Original file:

{"CustomerMasterData":{"Customer":[{"ID":"123456","name":"Customer Name",}]}}\n

If I change the copied file from ending with \r\n to \n, the MDF can read the file again. What is going on here? And how do I change the file write behavior or the MDF settings so that I can concatenate or copy files without the CRLF?

EDIT: NEW INFORMATION -- It seems on further review like maybe the minification/whitespace removal is the culprit. If I download the file created by the ADF copy and format it using a JSON formatter, it works. Maybe the CRLF -> LF masked something else. I'm not sure what to do at this point, but its super frustrating. Other possibly relevant information:

EDITED for clarity based on comments: In the case of a single JSON element in a file, I can get this to work -- data preview returns same success or failure as pipeline when run single JSON case

In the case of multiple documents merged by ADF I get the below instead. It seems on further review like maybe the minification/whitespace removal is the culprit. If I download the file created by the ADF copy and format it using a JSON formatter, it works. Maybe the CRLF -> LF masked something else. I'm not sure what to do at this point, but its super frustrating. multiple JSON case

Repro: Create any valid JSON as a single file, put it in blob storage, use it as a source in a mapping data flow, to do any sink operation. Create a second file with same schema, get them both to run in same flow using wildcard paths. Use a Copy Activity with Merge Files as the Sink Copy Activity and Array of Objects as the File pattern. Try to make your MDF use this new file. If it fails, download the file created by ADF, run it through a formatter (I have used both VS Code -> "Format Document" from standard VS Code JSON extension, and VS 2019 "Unminify" command) and reupload... It should work now.

Upvotes: 0

Views: 2614

Answers (2)

Hao Zhang
Hao Zhang

Reputation: 11

don't know if you already solved the problem: I came across the exact same problem 3 days ago and after several tries I found a solution:

  1. in the copy data activity under sink settings, use "set of objects" (instead of "array of objects") under File Pattern, so that the merged big JSON has the value of the original small JSON files written per line
  2. in the MDF after setting up the wildcard paths with the *.json pattern, under JSON Settings select: Document per line as the Document form.
    After that you should be good to go, as least it solved my problem. The automatic written CRLF in "array of objects" setting in the copy data activity should be a default setting and MSFT should provide the option to omit it in the settings in the future.

Upvotes: 1

Steve Johnson
Steve Johnson

Reputation: 8660

According to my test:

1.copy data activity can't change unix(LF) to windows(CRLF).

2.MDF can also parse unix(LF) file and windows(CRLF) file.

Maybe there is something else wrong. By the way,I see there is a comma after "name":"Customer Name" in your Original file,I delete it before my test.

Upvotes: 0

Related Questions