Reputation: 23
I have a BUNCH of fixed width text files that contain multiple transaction types with only 3 that I care about (121,122,124).
Sample File:
D103421612100188300000300000000012N000002000001000032021420170012260214201700122600000000059500000300001025798 D103421612200188300000300000000011000000000010000012053700028200004017000000010240000010000011NNYNY000001000003N0000000000 00 D1034216124001883000003000000000110000000000300000100000000000CS00000100000001200000033NN0 00000001200
So What I need to do is read line by line from these files and look for the ones that have a 121, 122, or 124 at startIndex = 9 and length = 3.
Each line needs to be parsed based on a data dictionary I have and the output needs to be grouped by transaction type into three different files.
I have a process that works but it's very inefficient, basically reading each line 3 times. The code I have is something like this:
@121 = EXTRACT
col1 string,
col2 string,
col3 string //ect...
FROM inputFile
USING new MyCustomExtractor(
new SQL.MAP<string, string> {
{"col1","2"},
{"col2","6"},
{"col3","3"} //ect...
};
);
OUTPUT @121
TO 121.csv
USING Outputters.Csv();
And I have the same code for 122 and 124. My custom extractor takes the SQL MAP and returns the parsed line and skips all lines that don't contain the transaction type I'm looking for.
This approach also means I'm running through all the lines in a file 3 times. Obviously this isn't as efficient as it could be.
What I'm looking for is a high level concept of the most efficient way to read a line, determine if it is a transaction I care about, then output to the correct file.
Thanks in advance.
Upvotes: 1
Views: 71
Reputation: 121
As of today, there is no specific U-SQL function that can define the output location of a tuple on the fly.
wBob presented an approach to a potential workaround. I'd extend the solution the following way to address your need:
If you have more feedback or needs, feel free to create an item (and voting for others) on our UserVoice site: https://feedback.azure.com/forums/327234-data-lake. Thanks!
Upvotes: 1
Reputation: 14389
How about pulling out the transaction type early using the Substring
method of the String datatype? Then you can do some work with it, filtering etc. A simple example:
// Test data
@input = SELECT *
FROM (
VALUES
( "D103421612100188300000300000000012N000002000001000032021420170012260214201700122600000000059500000300001025798" ),
( "D103421612200188300000300000000011000000000010000012053700028200004017000000010240000010000011NNYNY000001000003N0000000000 00" ),
( "D1034216124001883000003000000000110000000000300000100000000000CS00000100000001200000033NN0 00000001200" ),
( "D1034216999 0000000000000000000000000000000000000000000000000000000000000000000000000000000 00000000000" )
) AS x ( rawData );
// Pull out the transaction type
@working =
SELECT rawData.Substring(8,3) AS transactionType,
rawData
FROM @input;
// !!TODO do some other work here
@output =
SELECT *
FROM @working
WHERE transactionType IN ("121", "122", "124"); //NB Note the case-sensitive IN clause
OUTPUT @output TO "/output/output.csv"
USING Outputters.Csv();
Upvotes: 1