Reputation: 1842
You have a table containing a userID column and 2 function columns. The table has 100 million rows and 10's of thousands of users.
The problem is that each user has their function1 and function2 duplicated but in the opposite function order. I.E. User=1000 has F1=5, F2=10 but also F2=10, F2=5. So if you can filter off these duplicates you can save about 50% of the space this table is using.
What is the fastest way to filter out the duplicate functions for each user.
Upvotes: 0
Views: 1383
Reputation: 1270713
First, @TriV's answer is correct, as I read the question. I have no idea why it is downvoted or deleted.
Second, if you want to remove the rows and you know all are duplicated, then you can do:
delete from t
where function1 > function2;
That is not satisfying, because you want the fastest method. Deletion can be expensive, so it might be faster to:
select *
into temp_t
from t
where function1 < function2;
truncate table t;
insert into t
select *
from temp_t;
If you don't have full duplicates, then you can do the same idea with something like:
select *
into temp_t
from t
where function1 < function2
union all
select *
from t t
where function1 > function2 and
not exists (select 1 from t t2 where t2.function1 = t.function2 and t2.function2 = t.function1);
The latter expression is probably the fastest way to get the unique set, assuming you have an index on t(function1, function2)
.
Upvotes: 1
Reputation: 1689
To delete all duplicates you could proceed like this:
Add a new column temp_pk
and update it with unique values (rownum for example or you can use a sequence as well). Once the column is filled, define the PK over it.
Then you should be able to use this query to delete duplicate data:
DELETE FROM table
WHERE temp_pk IN
(
SELECT Max(temp_pk) AS pkToDelete
FROM
(
SELECT temp_pk, A, B
FROM table
WHERE A<=B
UNION ALL
SELECT temp_pk, B, A
FROM table
WHERE A>B
) AS u
GROUP BY A, B
HAVING COUNT(*) > 1
)
After that drop the the PK and the temp_ok column and redefine a PK over UserID, so this doesn't happen again the future.
A/B = your Function1/Function2 columns
Upvotes: 1
Reputation: 49270
You can use row_number
to get one such combination for a user in case (a,b) , (b,a) pairs exist.
select userid,function1,function2 from (
select userid,function1,function2
,row_number() over(partition by userid
order by
case when function1<function2 then function1 else function2 end,
case when function1>function2 then function1 else function2 end,
function1
) as rnum
from tablename
) t
where rnum=1
Use a cte to delete one of the symmetric pair combinations.
with cte as (select userid,function1,function2
,row_number() over(partition by userid
order by
case when function1<function2 then function1 else function2 end,
case when function1>function2 then function1 else function2 end,
function1
) as rnum
from tablename)
delete from cte where rnum > 1
Upvotes: 2