Reputation: 2473
I have a big table of process data from a production line that consists of many steps a product goes through during its creation in a Hadoop datalake. At certain points I am interested in finding out what happened N steps before a certain process took place.
I can write a select query which returns the rows corresponding to the process I can identify but how (and should I) can I return N rows prior to the one which meets my actual select query?
I want to know if it is possible to do such a thing in SQL/pyspark and if I should do it or if something like this should be done in python after selecting a much bigger set of data which contains the information I want and using python to filter out just the bits I need.
Example table structure:
|Col 1 | Col 2 | Col 3| Col 4 |
|A | 1 | One | Date 1 |
|C | 1 | Two | Date 2 |
|B | 1 | One | Date 1 |
|C | 2 | Two | Date 2 |
|C | 1 | Three| Date 3 |
|D | 2 | Four | Date 1 |
|E | 1 | Five | Date 5 |
SELECT * FROM Table1
WHERE COL 1 = "C" and COL 3 = "Three"
ORDER BY COL 4, Col 2
What would be returned
|Col 1 | Col 2 | Col 3| Col 4 |
|C | 1 | Three| Date 3 |
What I would like to be returned
|Col 1 | Col 2 | Col 3| Col 4 |
|C | 1 | One | Date 2 |
|C | 2 | Two | Date 2 |
|C | 1 | Three| Date 3 |
above N=2, but N should be variable. If an implementation is possible in SQL I am really interested in if this is something that should be done in SQL or if it is better to do it in code after. I can see arguments for both sides but need some outside opinions.
Edit: Both of the suggested approaches seemed to rely on the structure I provided previously where Col 2 was an incrementaly increasing value. This was misleading as I used Col 2 and the numbers as dummy values. I have now updated the table with two columns which show actually how the table is ordered correctly. First by a column which contains a datetime stamp and second by a column which contains an integer number.
The table is a table in a Hadoop datalake so suggested solutions should provide SQL which can execute in that environment.
Edit 2: It has also become apparent the rows themselves may not necessarily be consecutive so I don't want the last N rows but rather the last N rows that also meet a certain predicate. In the example above the predicate would be where Col2 = "C".
Upvotes: 0
Views: 1092
Reputation: 1269753
You can use a running count:
SELECT *
FROM (SELECT t1.*,
SUM(CASE WHEN COL1 = 'D' AND COL2 = 'Three THEN 1 ELSE 0 END) OVER (ORDER BY col2 ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) as cnt_special
FROM Table1 t1
WHERE cnt_special > 0;
Or, you can use window functions to identify the col2
value. This works if you want the offset to be logical (i.e. based on the values in col2
) rather than physical (i.e. the number of rows):
SELECT *
FROM (SELECT t1.*,
MAX(CASE WHEN COL1 = 'D' AND COL2 = 'Three THEN col2 END) OVER (ORDER BY col2) as special_col2
FROM Table1 t1
WHERE col2 <= special_col2 and col2 >= special_col2 - 2
Upvotes: 1
Reputation: 7960
You can use row_number()
function like below:
Declare @n int = 3
SELECT *
FROM (
select Col1,Col2,Col3,
Row_Number() over (ORDER BY Col2 ASC) as RowOrder
from table1
) x
where x.RowOrder <= @n
Upvotes: 1