Reputation: 163

How to perform self join on a large table

I have a very large table (~70 million rows) in which the id column is derived from another table. The ID has been wrongly assigned. 2 different people with different first names have been assigned the same id while they shouldn't have been assigned it. In theory I understand that a self join on the same table where the id is same and the first is not equal should work but with the sheer number of rows, the query is stuck. Any ideas as to how I can proceed?

I have tried to filter some data out and copied the contents to a whole new table but to no avail.

My code looks like thus

select distinct m1.first_name, m.first_name, m1.id from member m inner join member_1 m1 on m.id = m1.id and m.first_name <> m1.first_name

Data looks like this

run_id  first_name      id 
1       aaaa            1 
2       aaaa            1 
3       aaaa            1 
3       bbbb            2 
3       cccc            3 
4       aaaa            1 
4       bbbb            4 
5       bbbb            4 
5       cccc            3 
6       aaaa            1 
6       bbbb            4 
6       cccc            3

In this if you notice, in run_id 2, bbbb was assigned the id of 2 and in runs 5 and 6, bbbb was assigned an id of 4. In this case, bbbb appears to have been assigned 2 ids. How do I only filter out such rows and not pull aaaa and cccc. Using distinct is draining out my resources and doesnt appear to be a good option

Upvotes: 1

Answers (2)

Gordon Linoff

Reputation: 1269503

I would suggest using exists:

select m.*
from member_1 m
where exists (select 1
              from member_1 m2
              where m2.id = m.id and m2.first_name <> m.first_name
             );

Then, for performance, you want an index on member_1(id, first_name).

If you just want the ids, you can use aggregation:

select id
from member_1
group by id
having min(first_name) <> max(first_name);

Upvotes: 1

Zohar Peled

Reputation: 82474

Instead of a self join, use group by with having and count distinct:

SELECT first_name
FROM member 
GROUP BY first_name
HAVING COUNT(DISTINCT Id) > 1

This query will return a distinct list of first_name that has more than one id.

You can improve performance by adding an index on first_name and id if that is still running too slow.

Upvotes: 2

How to perform self join on a large table

Answers (2)

Related Questions