Aesir
Aesir

Reputation: 2473

select n rows prior to row that match select condition

I have a big table of process data from a production line that consists of many steps a product goes through during its creation in a Hadoop datalake. At certain points I am interested in finding out what happened N steps before a certain process took place.

I can write a select query which returns the rows corresponding to the process I can identify but how (and should I) can I return N rows prior to the one which meets my actual select query?

I want to know if it is possible to do such a thing in SQL/pyspark and if I should do it or if something like this should be done in python after selecting a much bigger set of data which contains the information I want and using python to filter out just the bits I need.

Example table structure:

|Col 1 | Col 2 | Col 3| Col 4  |
|A     | 1     | One  | Date 1 |
|C     | 1     | Two  | Date 2 |
|B     | 1     | One  | Date 1 |
|C     | 2     | Two  | Date 2 |
|C     | 1     | Three| Date 3 |
|D     | 2     | Four | Date 1 |
|E     | 1     | Five | Date 5 |

SELECT * FROM Table1
WHERE COL 1 = "C" and COL 3 = "Three"
ORDER BY COL 4, Col 2

What would be returned

|Col 1 | Col 2 | Col 3| Col 4  |
|C     | 1     | Three| Date 3 |

What I would like to be returned

|Col 1 | Col 2 | Col 3| Col 4  |
|C     | 1     | One  | Date 2 |
|C     | 2     | Two  | Date 2 |
|C     | 1     | Three| Date 3 |

above N=2, but N should be variable. If an implementation is possible in SQL I am really interested in if this is something that should be done in SQL or if it is better to do it in code after. I can see arguments for both sides but need some outside opinions.

Edit: Both of the suggested approaches seemed to rely on the structure I provided previously where Col 2 was an incrementaly increasing value. This was misleading as I used Col 2 and the numbers as dummy values. I have now updated the table with two columns which show actually how the table is ordered correctly. First by a column which contains a datetime stamp and second by a column which contains an integer number.

The table is a table in a Hadoop datalake so suggested solutions should provide SQL which can execute in that environment.

Edit 2: It has also become apparent the rows themselves may not necessarily be consecutive so I don't want the last N rows but rather the last N rows that also meet a certain predicate. In the example above the predicate would be where Col2 = "C".

Upvotes: 0

Views: 1092

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1269753

You can use a running count:

SELECT *
FROM (SELECT t1.*,
             SUM(CASE WHEN COL1 = 'D' AND COL2 = 'Three THEN 1 ELSE 0 END)  OVER (ORDER BY col2 ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) as cnt_special
      FROM Table1 t1
WHERE cnt_special > 0;

Or, you can use window functions to identify the col2 value. This works if you want the offset to be logical (i.e. based on the values in col2) rather than physical (i.e. the number of rows):

SELECT *
FROM (SELECT t1.*,
             MAX(CASE WHEN COL1 = 'D' AND COL2 = 'Three THEN col2 END)  OVER (ORDER BY col2) as special_col2
      FROM Table1 t1
WHERE col2 <= special_col2 and col2 >= special_col2 - 2

Upvotes: 1

Eray Balkanli
Eray Balkanli

Reputation: 7960

You can use row_number() function like below:

Declare @n int = 3

SELECT *
FROM (
   select Col1,Col2,Col3,
          Row_Number() over (ORDER BY Col2 ASC) as RowOrder
   from table1
) x
where x.RowOrder <= @n

Upvotes: 1

Related Questions