cgnorthcutt
cgnorthcutt

Reputation: 4036

Google Bigquery: Remove rows where column value is distinct (count = 1) without referencing table

Given a table, I wish to remove rows where a column value is distinct.

So if we want to do this for column2 on matrix A

     c1 c2 c3
A = |1  2  4 |
    |1  2  5 |
    |1  1  6 |

yields

     c1 c2 c3
A = |1  2  4 |
    |1  2  5 |

This can be done easily by

SELECT * FROM Table WHERE c2 IN
(SELECT c2 FROM Table GROUP BY c2 HAVING COUNT(*) > 1)

Unfortunately, in the middle of a subquery, you don't have the data stored in a Table and I don't want to create a view as I need to do this filtering all in one query.

Any ideas on how I can still filter out distinct rows with respect to a single column without referencing a Table in the subquery?

The solution should be of the form:

SELECT <something goes here>
FROM <the subquery which outputs A goes here>
<anything you want here that is legal Bigquery - e.g. can't reference A>

and there is no table to reference.

Upvotes: 1

Views: 1277

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1269633

Bigquery supports window functions, so you might be able to do this:

select t.*
from (select t.*, count(*) over (partition by col2) as cnt
      from table t
     ) t
where cnt >= 2;

This still references the table in a subquery, but that is the only reference to the table.

Upvotes: 4

Related Questions