Reputation: 11
I have a csv file which I am trying to process using Azure Data Lake Analytics U-SQL. I am fairly new to U-SQL so please bear with me. The original file is semi-structured which I managed to fix using the silent:true flag. Now that it is more structured, I would like to fill the empty cells with the data in the above cells.
My data looks like this: CSV with empty cells
My problem lies with the empty cells in the first four columns.
The second row has data which I would like to copy down into the empty cells below it (rows 3-5). The data from row 7 needs to be copied down to row 8, the data from row 9 to be copied down to rows 10-13 and the data from row 14 to be copied to rows 15-18.
This has to be done without changing the values in the 'Amount claimed' column.
Does anyone have any ideas on how to achieve this in U-SQL?
Thank you.
Upvotes: 1
Views: 643
Reputation: 3188
An alternative approach(LAST_VALUE doesn't work for me):
If you have some row number or timestamp field then there is no problem
@tb1 = SELECT * FROM
( VALUES
(1, "Noah1"),
(2, (string)null),
(3, "Noah3"),
(5, (string) null),
(6, (string)null),
(7, "Noah6"),
(8, "Noah7")
) AS T(Timestamp, a);
@tb1 =
SELECT Timestamp,
[a],
[a] != null && [a] != LEAD([a], 1) OVER(ORDER BY Timestamp ASC) AS aSwitch
FROM @tb1;
@tb1 =
SELECT Timestamp,
[a],
SUM(aSwitch ? 1 : 0) OVER(ORDER BY Timestamp ASC ROWS UNBOUNDED PRECEDING) AS aGrp
FROM @tb1;
@tb1 =
SELECT Timestamp,
FIRST_VALUE([a]) OVER(PARTITION BY aGrp ORDER BY Timestamp ASC) AS aFilled
FROM @tb1;
OUTPUT @tb1 TO "/test.csv" USING Outputters.Csv(outputHeader: true);
Result:
"Timestamp","aFilled"
1,"Noah1"
2,"Noah1"
3,"Noah3"
5,"Noah3"
6,"Noah3"
7,"Noah6"
8,"Noah7"
But what to do if you don't have such a field? In simple cases, you could use dumb field:
@tb1 = SELECT * FROM
( VALUES
("Noah1"),
((string)null),
("Noah3"),
((string) null),
((string)null),
("Noah6"),
("Noah7")
) AS T(a);
@tb1 = SELECT 1 AS Timestamp,
[a]
FROM @tb1;
Upvotes: 2
Reputation: 114
U-SQL is generally a language for processing large, order-agnostic data - this problem is not a good fit for it
Rowsets - the fundamental USQL building blocks - are unordered logical data containers. Thus the order of lines in the original input is lost the moment you read it into a rowset; you have to recreate the order within U-SQL using some ordering Key.
Assuming there is such an ordering key,
@data = SELECT A, LAST_VALUE(Col == "" ? null : Col) OVER (ORDER BY Key ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Col FROM @input;
should do it since LAST_VALUE should ignore nulls. Note - USQL documentation doesn't actually specify whether nulls are ignored - they should be per general aggregate / windowing function conventions, but this needs to be verified.
Your data doesn't have an ordering column - to create one, you would need to
This may be too complicated for an amount of data that you could just process locally before uploading.
Upvotes: 3
Reputation: 749
The LAG analytic function provides access to a row at a given physical offset that comes before the current row. Use this analytic function in a SELECT expression to compare values in the current row with values in a previous row.
https://msdn.microsoft.com/en-us/library/azure/mt791650.aspx
Upvotes: 2
Reputation: 858
I think you can solve this by using U-SQL user-defined operators (UDOs). In UDO you will iterate row by row and whenever you get row with empty values, copy data from previous row.
Upvotes: 0