ghw29
ghw29

Reputation: 27

Select Rows with unique column value in SQL

I've done a lot of searching and I feel like this should be an easy solution but I can't get anything to work.

I have a database table that holds a nested array. After I unnest the array and select just the columns I want, I end up with a table that looks like this:

SELECT time_period, name, value
FROM 
    TBL.TBLValues,
    UNNEST(nested_name) as unnested_name
WHERE time_period > '2021-07-01 00:00:00'
AND name != "None"
ORDER BY time_period;
row time_period name value
1 2021-07-01T00:00:00 Name1 100
2 2021-07-01T00:00:00 Name2 105
3 2021-07-01T00:05:00 Name1 120
4 2021-07-01T00:10:00 Name3 500
5 2021-07-01T00:15:00 Name1 110
6 2021-07-01T00:15:00 Name3 450
7 2021-07-01T00:20:00 Name1 1000

What I want to do is filter my query (probably through a nested query?) so I only get rows where time_period is unique. In the table above, I would only return rows 3 and 4, as all other rows have multiple names for the same time_period.

I tried SELECT DISTINCT(period) and that is filtering it down substantially as far as the number of returned rows is concerned, but it is definitely returning rows that have multiple names for the same time_period. I'm not sure why, my understanding of that function is that it should only return rows where time_period exists exactly once.

SELECT DISTINCT(time_period)
        FROM
        TBL.TBLValues,
        UNNEST(nested_name) as unnested_name
        WHERE time_period > '2021-07-01 00:00:00'
        AND name != 'None'
    ORDER BY period;

I also tried a COUNT(time_period) AS counter and then a HAVING counter = 1 at the end of my query. This came closest to what I want I think, it's returning very few results, I think the GROUP BY may have been doing something weird? It only gives me one time_period per name, but some of those time_periods are duplicates. Also ideally the next step in this filter is to take every unique time_period, then filter to the most recent time_period for each unique Name, so it would be nice to get this to initially return every unique time_period so I could do that next.

SELECT COUNT(time_period) as counter, time_period, name, value
FROM(
    SELECT time_period, name, value
    FROM  TBL.TBLValues,
            UNNEST(nested_name) as unnested_name
            WHERE time_period > '2021-07-01 00:00:00'
            AND name != 'None')
GROUP BY name, value, time_period
HAVING counter = 1
ORDER BY time_period;

I also tried reworking the solution from this question with the PARTITION BY but wasn't able to get it to filter my results at all.

SELECT time_period, name, value
FROM(
    SELECT time_period, name, value, ROW_NUMBER() OVER (PARTITION BY time_period ORDER BY value ASC) AS row_num
    FROM  TBL.TBLValues,
            UNNEST(nested_name) as unested_name
            WHERE time_period > '2021-07-01 00:00:00'
            AND name != 'None') as all_duplicated
WHERE row_num = 1
ORDER BY time_period;

Upvotes: 0

Views: 1522

Answers (3)

Gordon Linoff
Gordon Linoff

Reputation: 1269493

The simplest method is probably qualify

SELECT time_period, name, value
FROM TBL.TBLValues t CROSS JOIN
     UNNEST(t.nested_name) as unnested_name
WHERE time_period > '2021-07-01 00:00:00' AND
      name <> 'None'
QUALIFY COUNT(*) OVER (PARTITION BY time_period) = 1
ORDER BY time_period;

Upvotes: 0

Jon Armstrong
Jon Armstrong

Reputation: 4694

There are several ways to do this. Here's one (with standard SQL):

WITH xrows AS (
       SELECT tbl.*
            , COUNT(*) OVER (PARTITION BY time_period) AS n
         FROM tbl
     )
SELECT *
  FROM xrows
 WHERE n = 1
 ORDER BY time_period
;

and with your SQL as a starting point:

WITH your_sql AS (
        SELECT time_period, name, value
             , COUNT(*) OVER (PARTITION BY time_period) AS n
          FROM TBL.TBLValues,
            UNNEST(nested_name) as unnested_name
         WHERE time_period > '2021-07-01 00:00:00'
           AND name != 'None'
     )
SELECT *
  FROM your_sql
 WHERE n = 1
 ORDER BY time_period
;

and now with the given available data:

WITH your_sql (row, time_period, name, value) AS (
        SELECT 1, '2021-07-01T00:00:00', 'Name1', 100 UNION ALL
        SELECT 2, '2021-07-01T00:00:00', 'Name2', 105 UNION ALL
        SELECT 3, '2021-07-01T00:05:00', 'Name1', 120 UNION ALL
        SELECT 4, '2021-07-01T00:10:00', 'Name3', 500 UNION ALL
        SELECT 5, '2021-07-01T00:15:00', 'Name1', 110 UNION ALL
        SELECT 6, '2021-07-01T00:15:00', 'Name3', 450
     )
   , xrows AS (
          SELECT t.*
               , COUNT(*) OVER (PARTITION BY time_period) AS n
            FROM your_sql AS t
     )
SELECT * FROM xrows WHERE n = 1
 ORDER BY time_period
;

Result:

+-----+---------------------+-------+-------+---+
| row | time_period         | name  | value | n |
+-----+---------------------+-------+-------+---+
|   3 | 2021-07-01T00:05:00 | Name1 |   120 | 1 |
|   4 | 2021-07-01T00:10:00 | Name3 |   500 | 1 |
+-----+---------------------+-------+-------+---+

Here's the updated solution for the new requirement. I've added a duplicate row for row=3 (row=7), but only one of the rows will be shown. This case would have been removed in the previous COUNT logic:

WITH your_sql (row, time_period, name, value) AS (
        SELECT 1, '2021-07-01T00:00:00', 'Name1', 100 UNION ALL
        SELECT 2, '2021-07-01T00:00:00', 'Name2', 105 UNION ALL
        SELECT 3, '2021-07-01T00:05:00', 'Name1', 120 UNION ALL
        SELECT 7, '2021-07-01T00:05:00', 'Name1', 120 UNION ALL
        SELECT 4, '2021-07-01T00:10:00', 'Name3', 500 UNION ALL
        SELECT 5, '2021-07-01T00:15:00', 'Name1', 110 UNION ALL
        SELECT 6, '2021-07-01T00:15:00', 'Name3', 450
     )
   , xrows0 AS (
          SELECT t.*
               , ROW_NUMBER() OVER (PARTITION BY time_period ORDER BY name, value, row) AS n1
               , RANK()       OVER (PARTITION BY time_period ORDER BY name, value     ) AS n2
            FROM your_sql AS t
     )
   , xrows AS (
          SELECT t.*
               , MAX(n2) OVER (PARTITION BY time_period) AS m2
            FROM xrows0 AS t
     )
SELECT *
  FROM xrows
 WHERE m2 = 1
   AND n1 = 1
 ORDER BY time_period
;

Result:
+-----+---------------------+-------+-------+----+----+------+
| row | time_period         | name  | value | n1 | n2 | m2   |
+-----+---------------------+-------+-------+----+----+------+
|   3 | 2021-07-01T00:05:00 | Name1 |   120 |  1 |  1 |    1 |
|   4 | 2021-07-01T00:10:00 | Name3 |   500 |  1 |  1 |    1 |
+-----+---------------------+-------+-------+----+----+------+

and the new requirement, where only the name needs to be the same over that time_period, and with your new data row:

WITH your_sql (row, time_period, name, value) AS (
        SELECT 1, '2021-07-01T00:00:00', 'Name1',  100 UNION ALL
        SELECT 2, '2021-07-01T00:00:00', 'Name2',  105 UNION ALL
        SELECT 3, '2021-07-01T00:05:00', 'Name1',  120 UNION ALL
        SELECT 8, '2021-07-01T00:05:00', 'Name1',  120 UNION ALL
        SELECT 4, '2021-07-01T00:10:00', 'Name3',  500 UNION ALL
        SELECT 5, '2021-07-01T00:15:00', 'Name1',  110 UNION ALL
        SELECT 6, '2021-07-01T00:15:00', 'Name3',  450 UNION ALL
        SELECT 7, '2021-07-01T00:20:00', 'Name1', 1000
     )
   , xrows0 AS (
          SELECT t.*
               , ROW_NUMBER() OVER (PARTITION BY time_period ORDER BY name, row  ) AS n1
               , RANK()       OVER (PARTITION BY time_period ORDER BY name, value) AS n2
            FROM your_sql AS t
     )
   , xrows AS (
          SELECT t.*
               , MAX(n2) OVER (PARTITION BY time_period) AS m2
            FROM xrows0 AS t
     )
SELECT *
  FROM xrows
 WHERE m2 = 1
   AND n1 = 1
 ORDER BY time_period
;

+-----+---------------------+-------+-------+----+----+------+
| row | time_period         | name  | value | n1 | n2 | m2   |
+-----+---------------------+-------+-------+----+----+------+
|   3 | 2021-07-01T00:05:00 | Name1 |   120 |  1 |  1 |    1 |
|   4 | 2021-07-01T00:10:00 | Name3 |   500 |  1 |  1 |    1 |
|   7 | 2021-07-01T00:20:00 | Name1 |  1000 |  1 |  1 |    1 |
+-----+---------------------+-------+-------+----+----+------+

Upvotes: 1

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172944

Try below

select * from (
  select * from (
    SELECT time_period, name, value
    FROM TBL.TBLValues,
      UNNEST(nested_name) as unnested_name
    WHERE time_period > '2021-07-01 00:00:00'
    AND name != "None"
  )
  where true
  qualify count(*) over(partition by time_period) = 1
)
where true
qualify row_number() over(partition by name order by time_period desc) = 1    

If applied to sample data in your question - output is

enter image description here

Upvotes: 0

Related Questions