user989988
user989988

Reputation: 3736

Filter source data in Azure Data Factory (Data Flow) to get most recent values

I have the following data in my source:

enter image description here

I want to filter the data such that output will only include data with most recent date. For example: DEF has 2 values 2.5, 3-Jan and 4, 4-Jan. I want rows with most recent date (4, 4-Jan) so that the output will contain the following result. How do I do that in data flow?

enter image description here

Upvotes: 0

Views: 2360

Answers (1)

Steve Johnson
Steve Johnson

Reputation: 8660

I tried it successfully.Please follow the steps.

First step: create a source(I used csv file with the data you provided),RecentDate column should be like '01-04-2020' rather than '04-Jan' because '04-Jan' can't be sorted. The setting of Projection see the picture.

Second step: create an aggregate, please choose 'Movie' column in the Group By configuration, and the setting of Aggregates please see the image.

Third step: create a new branch(click '+' at the bottom right corner of source)

Fourth step: create a join and setting please see the image.

The last step: create a select and remove two duplicate column('Movie' and 'RecentDate') then output the sink.

Hope these can help you.

Below is all images:

image


Update Answer:

Below is how to output only max rate:

This is my test data: Movie,MaxRate,RecentDate ABC,3,02-01-2020 DEF,2.5,03-02-2020 DEF,4,04-01-2020 DEF,6,03-02-2020 This is total flow image:

total

First step:create a source.The setting of Projection like this

source

Second step: create an aggregate, please choose 'Movie' and 'RecentDate' column in the Group By configuration, and the setting of Aggregates is below:

aggregate

Third step: create an aggregate, please choose 'Movie' column in the Group By configuration, and the setting of Aggregates is below:

aggregate2

Fourth step: create a join and setting please see the image

join

The last step: create a select and remove two duplicate column('Movie' and 'RecentDate') then name 'Rate' as 'MaxRate',finally output the sink.

Below is output: Movie,MaxRate,RecentDate ABC,3,02-01-2020 DEF,6,03-02-2020

If you want to output the min rate,just change the max($$) to min($$) in Second step.

If you want to output the both max and min rate, please do like above(max rate flow) until creatingselectfinish and then New branch do min rate flow until creating select finish,finally join two select and delete duplicate column.

If you have another question,please let me know.

Upvotes: 4

Related Questions