Reputation: 4008
I'm seeing a significant discrepancy in Data Flow results when using a Cache Sink vs a Data Set Sink. I recreated a simple example to demonstrate.
I uploaded a simple JSON file to Azure Data Lake Storage Gen 2:
{
"data": [
{
"id": 123,
"name": "ABC"
},
{
"id": 456,
"name": "DEF"
},
{
"id": 789,
"name": "GHI"
}
]
}
I created a simple Data Flow that loads this JSON file, flattens it out, then returns it via a Sink. I'm primarily interested in using a Cache Sink because the output is small and I will ultimately need the output for the next pipeline step. (Write to activity output is checked.)
You can see that the Data Preview shows all 3 rows. (I have two sinks in this example simply because I'm illustrating that these do not match.)
Next, I create a pipeline to run the data flow:
Now, when I debug it, the Data Flow output only shows 1 record:
"output": {
"TestCacheSink": {
"value": [
{
"id": 123,
"name": "ABC"
}
],
"count": 1
}
},
However, the second Data Set Sink contains all 3 records:
{"id":123,"name":"ABC"}
{"id":456,"name":"DEF"}
{"id":789,"name":"GHI"}
I expect that the output from the Cache Sink would also have 3 records. Why is there a discrepancy?
Upvotes: 1
Views: 1743
Reputation: 5044
When you choose cache as a sink, you will not be allowed to use logging. You see the below error during validation before debug.
To fix which, when you select "none" for logging, it automatically checks "first row only" property! This is causing it to write only the first row to cache sink. You just have to manually uncheck it before running debug.
Here is how it looks...
Upvotes: 9