TheLegendaryCopyCoder
TheLegendaryCopyCoder

Reputation: 1842

SQL - How to select A&B results excluding B&A redundent results from same table

You have a table containing a userID column and 2 function columns. The table has 100 million rows and 10's of thousands of users.

enter image description here

The problem is that each user has their function1 and function2 duplicated but in the opposite function order. I.E. User=1000 has F1=5, F2=10 but also F2=10, F2=5. So if you can filter off these duplicates you can save about 50% of the space this table is using.

What is the fastest way to filter out the duplicate functions for each user.

Upvotes: 0

Views: 1383

Answers (3)

Gordon Linoff
Gordon Linoff

Reputation: 1270713

First, @TriV's answer is correct, as I read the question. I have no idea why it is downvoted or deleted.

Second, if you want to remove the rows and you know all are duplicated, then you can do:

delete from t
    where function1 > function2;

That is not satisfying, because you want the fastest method. Deletion can be expensive, so it might be faster to:

select *
into temp_t
from t
where function1 < function2;

truncate table t;

insert into t
    select *
    from temp_t;

If you don't have full duplicates, then you can do the same idea with something like:

select *
into temp_t
from t
where function1 < function2
union all
select *
from t t
where function1 > function2 and
      not exists (select 1 from t t2 where t2.function1 = t.function2 and t2.function2 = t.function1);

The latter expression is probably the fastest way to get the unique set, assuming you have an index on t(function1, function2).

Upvotes: 1

daZza
daZza

Reputation: 1689

To delete all duplicates you could proceed like this:

Add a new column temp_pk and update it with unique values (rownum for example or you can use a sequence as well). Once the column is filled, define the PK over it.

Then you should be able to use this query to delete duplicate data:

DELETE FROM table
WHERE temp_pk IN
(
    SELECT Max(temp_pk) AS pkToDelete
    FROM
        (
            SELECT temp_pk, A, B
            FROM table
            WHERE A<=B
        UNION ALL
            SELECT temp_pk, B, A
            FROM table
            WHERE A>B
        ) AS u
    GROUP BY A, B
    HAVING COUNT(*) > 1
)

After that drop the the PK and the temp_ok column and redefine a PK over UserID, so this doesn't happen again the future.

A/B = your Function1/Function2 columns

Upvotes: 1

Vamsi Prabhala
Vamsi Prabhala

Reputation: 49270

You can use row_number to get one such combination for a user in case (a,b) , (b,a) pairs exist.

select userid,function1,function2 from (
select userid,function1,function2
,row_number() over(partition by userid
                   order by 
                   case when function1<function2 then function1 else function2 end,
                   case when function1>function2 then function1 else function2 end,
                   function1
                  ) as rnum
from tablename
) t
where rnum=1

Use a cte to delete one of the symmetric pair combinations.

with cte as (select userid,function1,function2
             ,row_number() over(partition by userid
                   order by 
                   case when function1<function2 then function1 else function2 end,
                   case when function1>function2 then function1 else function2 end,
                   function1
                  ) as rnum
            from tablename)
delete from cte where rnum > 1

Upvotes: 2

Related Questions