Reputation: 163
I have a very large table (~70 million rows) in which the id column is derived from another table. The ID has been wrongly assigned. 2 different people with different first names have been assigned the same id while they shouldn't have been assigned it. In theory I understand that a self join on the same table where the id is same and the first is not equal should work but with the sheer number of rows, the query is stuck. Any ideas as to how I can proceed?
I have tried to filter some data out and copied the contents to a whole new table but to no avail.
My code looks like thus
select distinct m1.first_name, m.first_name, m1.id from member m inner join member_1 m1 on m.id = m1.id and m.first_name <> m1.first_name
Data looks like this
run_id first_name id
1 aaaa 1
2 aaaa 1
3 aaaa 1
3 bbbb 2
3 cccc 3
4 aaaa 1
4 bbbb 4
5 bbbb 4
5 cccc 3
6 aaaa 1
6 bbbb 4
6 cccc 3
In this if you notice, in run_id 2, bbbb was assigned the id of 2 and in runs 5 and 6, bbbb was assigned an id of 4. In this case, bbbb appears to have been assigned 2 ids. How do I only filter out such rows and not pull aaaa and cccc. Using distinct is draining out my resources and doesnt appear to be a good option
Upvotes: 1
Views: 913
Reputation: 1269503
I would suggest using exists
:
select m.*
from member_1 m
where exists (select 1
from member_1 m2
where m2.id = m.id and m2.first_name <> m.first_name
);
Then, for performance, you want an index on member_1(id, first_name)
.
If you just want the ids, you can use aggregation:
select id
from member_1
group by id
having min(first_name) <> max(first_name);
Upvotes: 1
Reputation: 82474
Instead of a self join, use group by
with having
and count distinct
:
SELECT first_name
FROM member
GROUP BY first_name
HAVING COUNT(DISTINCT Id) > 1
This query will return a distinct list of first_name
that has more than one id
.
You can improve performance by adding an index on first_name
and id
if that is still running too slow.
Upvotes: 2