Reputation: 1862
I have a single table holding DNA variants for different people. I want to show the variants that are unique to a person:
Table DNA (engine ordered by variant):
person | variant
p1 | v1
p1 | v2
p1 | v3
p2 | v2
p2 | v3
p3 | v2
p3 | v3
p4 | v2
p4 | v3
So a simple query:
select variant from DNA where person = 'p1' and variant
not in (select variant from DNA where person in ('p2', 'p3'))
will return all variants unique to p1 vs. p2 and p3 (p4 not considered for this query). However - it is slow and runs out of memory.
Should I be doing this a different way?
Upvotes: 1
Views: 1995
Reputation: 101
I suspect that the reason it is running out of memory is that the select variant from DNA where person in ('p2', 'p3')
sub-query will result in v2, v3, v2, v3
. This, especially when brought to scale, seems exceedingly inefficient because of the repetition. Potentially, adding distinct
to the query may help, but in general this seems like an inefficient method of achieving your results if you have a lot of people (you'd have to manually type a lot of people in where person in (.........)
.
An alternative to this is to do a self join and basically limit the results to those where the only match is itself. Something like:
SELECT person, COUNT(*)
FROM (
SELECT * FROM table
ALL LEFT JOIN table
USING variant
)
GROUP BY person
HAVING COUNT(*) == 1;
Upvotes: 2