Reputation: 591
I'm working with SQL and was wondering how I would get all of the rows where values in 2 columns are equal. For example, imagine this table:
+----+---------+
| ID | Version |
+----+---------+
| AB | 1 |
| AB | 1 |
| BA | 2 |
| BA | 2 |
| CB | 1 |
+----+---------+
I want to select all rows where the IDs and versions match other rows with the same values in their ID and Version columns. In other words, I want to find duplicate values. So the desired output would be:
+----+---------+
| ID | Version |
+----+---------+
| AB | 1 |
| AB | 1 |
| BA | 2 |
| BA | 2 |
+----+---------+
How would I go about doing this as efficiently as possible in a table with over a million rows?
Upvotes: 1
Views: 3014
Reputation: 6751
If you need a duplicates count per group, then use GROUP BY ... HAVING
. If you need an overall count of duplicated rows, then use another aggregation over the group by.
For Oracle (fiddle) it can be done in single step:
with a as (
select 1 as id, 'A' as v from dual union all
select 1 as id, 'A' as v from dual union all
select 1 as id, 'B' as v from dual union all
select 1 as id, 'B' as v from dual union all
select 2 as id, 'C' as v from dual
)
select sum(count(1)) as total_duplicates
, count(count(1)) as duplicate_groups
from a
group by id, v
having count(1) > 1
For SQL Server (fiddle), for example, it does not work, so I've added another select over it
with a as (
select 1 as id, 'A' as v union all
select 1 as id, 'A' as v union all
select 1 as id, 'B' as v union all
select 1 as id, 'B' as v union all
select 2 as id, 'C' as v
)
select sum(cnt) as total_duplicates
, count(cnt) as duplicate_groups
from (
select count(1) as cnt
from a
group by id, v
having count(1) > 1
) as q
| total_duplicates | duplicate_groups |
+-------------------------------------+
| 4 | 2 |
Upvotes: 1
Reputation: 1269453
The simplest method are probably window functions:
select t.*
from (select t.*,
count(*) over (partition by id, version) as cnt
from t
) t
where cnt >= 2;
If you have an index on (id, version)
(or (version, id)
), then the database engine should be able to take advantage of that.
Upvotes: 3