Reputation: 17
I have a table that looks like this:
ID | Supplier Number | Supplier Name | Address | Postcode | City | State | First Name | Last Name |
---|---|---|---|---|---|---|---|---|
1 | 13 | Example.com | Uwanted Data | Unwanted Data | ||||
1 | 15 Example St | 9999 | Brisbane | QLD | ||||
1 | Unwanted Data | Uwanted Data | John | Doe | ||||
2 | 16 | New Example Services | Uwanted Data | Unwanted Data | ||||
2 | 15 Test Drive | 6789 | Melbourne | VIC | ||||
2 | Unwanted Data | Uwanted Data | Jane | Test |
Where "Unwanted Data" refers to values that are not needed in the final dataset. What I am looking to do is to remove the "Unwanted Data" values and then merge the rows that have the same key such that we end up with the following table:
ID | Supplier Number | Supplier Name | Address | Postcode | City | State | First Name | Last Name |
---|---|---|---|---|---|---|---|---|
1 | 13 | Example.com | 15 Example St | 9999 | Brisbane | QLD | John | Doe |
2 | 16 | New Example Services | 15 Test Drive | 6789 | Melbourne | VIC | Jane | Test |
Is there any way to do this within Azure Data Factory? Alternatively, I can connect to an Azure SQL Database to run any SQL commands that me be able to do the work.
Many thanks in advance.
Edit: In some cases, the unwanted data may be of a similar type or value to the required value in the column. For example, in the column "First Name" there may be Joe and John for one particular ID. However, the desired value is in the exact same location relative to every ID. That is, the desired first name is in the third row of each ID.
Upvotes: 0
Views: 190
Reputation: 5074
As @Nick.McDermaid, @Anand Sowmithiran mentioned in the comments, identify the unwanted data and you can replace unwanted data/uwanted data
in each column with blank or NULL and apply max() function to get the aggregated values.
You can write a query directly in the Azure SQL database to get the expected results as below.
select id ID,
max(case when [Supplier Number] = 'Uwanted Data' or [Supplier Number] = 'Unwanted Data' then '' else [Supplier Number] end) [Supplier Number],
max(case when [Supplier Name]= 'Uwanted Data' or [Supplier Name] = 'Unwanted Data' then '' else [Supplier Name] end) [Supplier Name],
max(case when [Address] = 'Uwanted Data' or [Address] = 'Unwanted Data' then '' else [Address] end) [Address],
max(case when [Postcode]= 'Uwanted Data' or [Postcode] = 'Unwanted Data' then '' else [Postcode] end) [Postcode],
max(case when [City]= 'Uwanted Data' or [City] = 'Unwanted Data' then '' else [City] end) [City],
max(case when [State]= 'Uwanted Data' or [State] = 'Unwanted Data' then '' else [State] end) [State],
max(case when [First Name]= 'Uwanted Data' or [First Name] = 'Unwanted Data' then '' else [First Name] end) [First Name],
max(case when [Last Name]= 'Uwanted Data' or [Last Name] = 'Unwanted Data' then '' else [Last Name] end) [Last Name]
from tb1
group by id
This gives your expected results only when there is a single correct value in a column per ID. It gives different results if there are more than 1 correct value in a column per ID value.
Upvotes: 1