Reputation: 45
I have a transaction table where I am trying to map Outflows to Inflows based on the LIFO.
Input dataset
Id | Date | Type | Amount |
---|---|---|---|
1 | 26-01-2024 | Inflow | 519 |
2 | 26-01-2024 | Outflow | 100 |
3 | 26-01-2024 | Outflow | 139 |
4 | 26-01-2024 | Outflow | 122 |
5 | 29-01-2024 | Outflow | 42 |
6 | 29-01-2024 | Inflow | 713 |
7 | 29-01-2024 | Inflow | 887 |
8 | 29-01-2024 | Outflow | 92 |
9 | 29-01-2024 | Outflow | 1593 |
10 | 29-01-2024 | Outflow | 25 |
Desired Output mapped Dataset
Inflow_Id | Inflow_Date | Outflow_Id | Outflow_Date | inflow_amount | outflow_amount | map_amount | inflow_left | outflow_left |
---|---|---|---|---|---|---|---|---|
1 | 26-01-2024 | 2 | 26-01-2024 | 519 | 100 | 100 | 419 | 0 |
1 | 26-01-2024 | 3 | 26-01-2024 | 519 | 139 | 139 | 280 | 0 |
1 | 26-01-2024 | 4 | 26-01-2024 | 519 | 122 | 122 | 158 | 0 |
1 | 26-01-2024 | 5 | 29-01-2024 | 519 | 42 | 42 | 116 | 0 |
7 | 29-01-2024 | 8 | 29-01-2024 | 887 | 92 | 92 | 795 | 0 |
7 | 29-01-2024 | 9 | 29-01-2024 | 887 | 1593 | 795 | 0 | 798 |
6 | 29-01-2024 | 9 | 29-01-2024 | 713 | 1593 | 713 | 0 | 85 |
1 | 26-01-2024 | 9 | 29-01-2024 | 519 | 1593 | 85 | 31 | 0 |
1 | 26-01-2024 | 10 | 29-01-2024 | 519 | 25 | 25 | 6 | 0 |
If you look at the last four lines of the output dataset, outflows 9,10 are mapped to inflow 1 because inflows 6 and 7 were exhausted first.
Between 6 and 7, which are both inflows, the subsequent outflows are mapped to 7 first(Last In, so first out) before 6.
This is a practical asset allocation problem. Imagine bank account transactions. A customer deposits cash to then wire transfer that amount to family. It seems logical that the wire transfer is tagged towards the amount deposited prior to the transfer( Last In First Out), instead being allocated to existing balance from a previous inflow ( First In First Out). Hence, LIFO is better for this use-case than FIFO.
1)The above output format is essential. It tracks which outflow is tagged to which inflow, and vice versa.
2)There could be multiple outflows(debits) that could be paid from one inflow(credit).
3)There could be multiple inflows(credits) that are then used to withdraw one outflow(debit).
4)Hence, the map amount helps us understand how much of an outflow and inflow was matched, and what was left over to be mapped to a future transaction.
5)Sum(Outflows) is not greater than sum(inflows). Money that is not in the account cannot be spent. Although account going into overdraft (negative) is technically possible, let's ignore that scenario, or just not assign that outflow.
6)The first transaction is usually an Inflow. There needs to be an inflow first before it can be spent.
Logic to be implemented:
For every inflow, the outflows following the inflow should be mapped first
If outflows after an inflow exceed the inflow amount, then the subsequent outflows should be attributed to the previous inflow.
If an inflow has not been attributed, it needs to show up as a inflow without an outflow mapping ( it could be mapped later as new outflows are added to the table)
FIFO logic from the below links using running total and previous total seem straightforward - Row pairing according to FIFO logic in SQL Server 2018
But LIFO seems more complicated because it feels like there is some kind of recursion needed, as running totals need to be reset after every inflow, and outflows need to be assigned to previous inflows. I would actually prefer a solution without recursion, but I don't know if that is possible.
I found a SAS implementation that is exactly what I am looking for. Looking to port this SQL/Python - https://communities.sas.com/t5/SAS-Programming/How-to-create-a-Last-in-First-out-LIFO-Algorithm-Using-Hash/td-p/413730
I am trying to implement this on GCP ( Bigquery ) for a dataset that contains 100s of transactions for every customer, for about 4 million customers.
Upvotes: 1
Views: 406
Reputation: 56
It seems that there is no nice set-based approach to this. As others have pointed out it will likely have to be iterative.
The bad news is that you will have to loop through every record once. The good news is (depending what information you actually want) you don't need to do an insert for every record.
I've written a sample solution to this problem. Please forgive the verbosity it is is written in T-SQL (MSSQL). I did not keep track of the extra columns you have in your sample answer, instead I focussed just on solving for the label of the equivalence class.
After all, that is essentially what we are trying to do here. Bundle records together, which are in some way 'equivalent', into mutually exclusive partitions of the underlying data set. What do we call each of these mutually exclusive partitions? That is what I refer to as the label of the equivalence class. In this case it is the Inflow ID that the Outflow records correspond to.
--Create the input data set
DROP TABLE IF EXISTS #InputData;
CREATE TABLE #InputData(Id INTEGER NOT NULL
, Date DATE NOT NULL
, Type VARCHAR(7) NOT NULL
, Amount BIGINT NOT NULL
);
INSERT INTO #InputData
SELECT X.Id
, CONVERT(DATE, X.Date, 105) --Converting from VARCHAR to DATE
, X.Type
, X.Amount
FROM (
VALUES(1, '26-01-2024', 'Inflow', 519)
, (2, '26-01-2024', 'Outflow', 100)
, (3, '26-01-2024', 'Outflow', 139)
, (4, '26-01-2024', 'Outflow', 122)
, (5, '29-01-2024', 'Outflow', 42)
, (6, '29-01-2024', 'Inflow', 713)
, (7, '29-01-2024', 'Inflow', 887)
, (8, '29-01-2024', 'Outflow', 92)
, (9, '29-01-2024', 'Outflow', 1593)
, (10, '29-01-2024', 'Outflow', 25)
) X(Id, Date, Type, Amount);
SELECT * FROM #InputData;
--Need to iterate through this from the last inflow to the first.
--Declare variables used throughout
DECLARE @max_outflow_id BIGINT;
DECLARE @current_inflow_id BIGINT;
DECLARE @current_inflow_amount BIGINT;
DECLARE @smallest_processed_outflow_id BIGINT;
DECLARE @current_inflow_count BIGINT;
DECLARE @running_outflow_total BIGINT;
DECLARE @excess_amount BIGINT;
DECLARE @upper_outflow_id BIGINT;
DECLARE @lower_outflow_id BIGINT;
DECLARE @final_outflow_id BIGINT;
DECLARE @next_insert_lower_outflow_id BIGINT;
DECLARE @amount BIGINT;
DECLARE @id BIGINT;
--Create tables used throughout
DROP TABLE IF EXISTS #Inflows
CREATE TABLE #Inflows(id BIGINT NOT NULL
, amount BIGINT NOT NULL
, iteration_order BIGINT NOT NULL
);
--This table will store the results
DROP TABLE IF EXISTS #InflowOutflowMapping
CREATE TABLE #InflowOutflowMapping (inflow_id BIGINT NOT NULL
, lower_outflow_id BIGINT NOT NULL
, upper_outflow_id BIGINT NOT NULL
);
--Get the maximum id in the table associated with an Outflow
SELECT @max_outflow_id = MAX(id)
FROM #InputData
WHERE type = 'Outflow';
--Keep a list of the inflow Ids from last to first, as well as their amounts.
INSERT INTO #Inflows(id
, amount
, iteration_order
)
SELECT id
, amount
, ROW_NUMBER() OVER(ORDER BY id DESC) AS iteration_order
FROM #InputData
WHERE type = 'Inflow'
ORDER BY iteration_order ASC;
--Get the first inflow id to be processed
SELECT @current_inflow_id = id
, @current_inflow_amount = Amount
FROM #Inflows
WHERE iteration_order = 1;
SET @current_inflow_count = 1;
--Get the lower and upper bound for the outflow ids which will be considered first
SELECT @lower_outflow_id = id + 1 --This only works due to careful construction
FROM #Inflows
WHERE iteration_order = 1;
SET @upper_outflow_id = @max_outflow_id;
--Get the outflow id that corresponds to the inflow id that will be processed last.
SELECT @final_outflow_id = MAX(id)
FROM #InputData
WHERE id < (SELECT Id
FROM #Inflows
WHERE iteration_order = (SELECT MAX(iteration_order) - 1
FROM #Inflows)
)
AND type = 'Outflow';
--Cater for the edge case where there is only one inflow
SELECT @final_outflow_id = COALESCE(@final_outflow_id, MAX(id))
FROM #InputData
WHERE type = 'Outflow'
--Keep a tally of the running total of outflows that offset against an inflow
SET @running_outflow_total = 0;
WHILE @upper_outflow_id IS NOT NULL
BEGIN
--Build a cursor that will be used like a FOR loop
DECLARE outflow_window CURSOR FOR
SELECT id
, amount
FROM #InputData
WHERE id BETWEEN @lower_outflow_id AND @upper_outflow_id
AND type = 'Outflow'
ORDER BY id ASC;
--Active the cursor, just some t-sql syntax.
OPEN outflow_window;
--Take the first record from the cursor, and assign local variables from it.
FETCH NEXT FROM outflow_window
INTO @id
, @amount;
--Need to keep track of the lower bound outflow_id for the next insert operation.
SET @next_insert_lower_outflow_id = @lower_outflow_id;
WHILE @@FETCH_STATUS = 0 --Think of this as a FOR loop over the outflow_window cursor
BEGIN
--Have we exceeded the current inflow amount with our tallied outflows?
IF @running_outflow_total + @amount >= @current_inflow_amount
BEGIN
--Inflow has been consumed, store the result
INSERT INTO #InflowOutflowMapping (inflow_id
, lower_outflow_id
, upper_outflow_id
)
VALUES(@current_inflow_id
, @next_insert_lower_outflow_id
, @id
);
SET @next_insert_lower_outflow_id = @id
--By how much have we exceeded it?
SET @excess_amount = @running_outflow_total + @amount - @current_inflow_amount;
--Reset running total of outflow amount
SET @running_outflow_total = 0;
--Let's get ready to move onto the next inflow
SET @current_inflow_count = @current_inflow_count + 1;
--The excess amount may contribute to multiple inflows (if it exceeds each inflows amount)
WHILE @excess_amount > 0
BEGIN
--Get the next inflow id, and the amount associated with it
SELECT @current_inflow_id = id
, @current_inflow_amount = amount
FROM #Inflows
WHERE iteration_order = @current_inflow_count;
--if the excess exceeds the current_inflow_amount, then there is some new excess leftover.
IF @excess_amount >= @current_inflow_amount
BEGIN
--Inflow has been consumed (by a single outflow), store the result
INSERT INTO #InflowOutflowMapping (inflow_id
, lower_outflow_id
, upper_outflow_id
)
VALUES(@current_inflow_id
, @next_insert_lower_outflow_id
, @next_insert_lower_outflow_id
);
--Calculate the new excess.
SET @excess_amount = @excess_amount - @current_inflow_amount;
--Let's get ready to move onto the next inflow
SET @current_inflow_count = @current_inflow_count + 1;
END;
ELSE --Otherwise, the excess is smaller than the @current_inflow_amount
BEGIN
--Set the outflow total to the excess
SET @running_outflow_total = @excess_amount;
--Excess is now back to 0. forces an exit on the inner while loop
SET @excess_amount = 0;
END;
END;
END;
ELSE -- inflow amount is not exceeded
BEGIN
--Increment the running total of outflow amounts consumed.
SET @running_outflow_total = @running_outflow_total + @amount;
END;
--Assigning values for the next loop iteration
FETCH NEXT FROM outflow_window
INTO @id
, @amount
END;
--End of the inner most for loop, need to attribute any remaining outflows to the current inflow
INSERT INTO #InflowOutflowMapping (inflow_id
, lower_outflow_id
, upper_outflow_id
)
VALUES(@current_inflow_id
, @next_insert_lower_outflow_id
, @upper_outflow_id
);
--Just some t-sql syntax for closing and removing the cursor. (so we can declare it again with no issues)
CLOSE outflow_window;
DEALLOCATE outflow_window;
--We've reached the end of the window of outflows we just looped through.
--Need to now look at all outflows (that we haven't already seen) since the current inflow
SELECT @upper_outflow_id = MAX(id)
FROM #InputData
WHERE id < @lower_outflow_id --When this returns nothing, @upper_outflow_id will be NULL and the outer-most WHILE will terminate
AND type = 'Outflow';
SELECT @lower_outflow_id = MIN(id)
FROM #InputData
WHERE id > @current_inflow_id
AND type = 'Outflow';
END;
SELECT *
FROM #InflowOutflowMapping;
SELECT ind.*
, iom.inflow_id
FROM #InputData ind
LEFT JOIN #InflowOutflowMapping iom
ON ind.id BETWEEN iom.lower_outflow_id AND iom.upper_outflow_id
AND ind.type = 'Outflow'
ORDER BY ind.id ASC
;
If you want to try it yourself, dbfiddle link here
Now, if you insist on keeping track of the other information (inflow_left, outflow_left etc...) so that you can see how much contribution each outflow made to an inflow. I see no way around it, you'd have to do an insert for each record. Regardless I think the iterative solution will look similar to my example. That is to say there will be an outer most WHILE loop, with a FOR loop inside, and a WHILE loop inside that. Probably could write it recursively, but I'm not sure it will be any easier to understand than this iterative mess...
Assuming you manage to use this approach to load your billions of records, subsequent loads should be a lot less of a burden. You need to keep track of the earliest inflow you got up to (say X), and the relevant amount leftover for that inflow. You also need to keep track of the latest inflow you processed fully (say Y). Then your delta will be everything later than Y, everything earlier than X, and X itself taking special care with the correct amount.
Upvotes: 0
Reputation: 3528
LIFO works for discrete values:
WITH raw as (SELECT "u" as user, id+1 as id, x from unnest([1,1,1,1,1,-1,-1,1,-1,-1,1,1,-1,-1,-1]) as x with offset id),
tbl1 as (
SELECT * ,
sum(x) over win1 as sums
from raw
window win1 as (partition by user order by id rows between unbounded preceding and current row)
)
SELECT *,
if(x<0,last_value(id) over win1,null) as take_last_id
from tbl1
window win1 as (partition by user,(if(x<0,sums+1,sums)) order by id rows between unbounded preceding and 1 preceding)
order by 1,2
So the best way would be that you unnest the dataset by each value.
For continuous values, the partion by
only works, if the value is rounded down to an already present entry. This is done in the subselect.
WITH raw as (SELECT "u" as user, id+1 as id, x from
unnest([519,-100,-139,-122,-42,713,887,-92,-1593,-25])
#unnest([519,-100,-139,-122,-42,713,887,-92,-500,-500,-500,-25,5000,-200])
as x with offset id),
tbl1 as (
SELECT * ,
sum(x) over win1 as sums
from raw
window win1 as (partition by user order by id rows between unbounded preceding and current row)
), tbl2 as (
SELECT *,
array_agg(struct(sums as sumsv,x as xv)) over win1 as flow,
from tbl1
window win1 as (partition by user order by id rows between unbounded preceding and current row)
),
tbl3 as (
SELECT * ,
((
SELECT any_value(y.sumsv having min out_ok) from(
SELECT y,if(y.sumsv>=sums-x and id>in1+1 and y.xv>0,in1,null) as out_ok from unnest(flow) as y with offset in1
)
)) as sum_find_prev,
from tbl2
),tbl4 as (
SELECT *,
if(x<0,last_value(if(x>0,id,null) ignore nulls) over win2,null) as take_last_id,
from tbl3
window win2 as (partition by user,(if(x<0,sum_find_prev,sums)) order by id rows between unbounded preceding and 1 preceding)
)
SELECT * except(flow),
from tbl4
order by 1,2
Upvotes: 0