Reputation: 97
Given a table T with columns X and Y I want to get pairs for columns X and Y but once a value is used in a pair I don't want it to appear again. Example:
X Y
-----
A 1
A 2
A 3
B 1
B 2
B 3
C 4
D 5
D 6
E 5
E 6
I want the SQL to output:
X Y
-----
A 1
B 2
C 4
D 5
E 6
note that the value of 3 in Y needs to be ignored as there is no other matching value left for it on column X. The order needs to be preserved and the first matching pairs are valid.
Upvotes: 1
Views: 253
Reputation: 2020
This can be solved in pure SQL only with advanced features of specific DB (like Oracle). Also this task is a good example to demonstrate when specific SQL features really do matter.
Below I'm going to demonstrate various approaches using Oracle and Spark.
create table t(x, y) as
(select 'A', 1 from dual
union all select 'A', 2 from dual
union all select 'A', 3 from dual
union all select 'B', 1 from dual
union all select 'B', 2 from dual
union all select 'B', 3 from dual
union all select 'C', 4 from dual
union all select 'D', 5 from dual
union all select 'D', 6 from dual
union all select 'E', 5 from dual
union all select 'E', 6 from dual);
1. Oracle model clause
(Version 10gR1 and later)
SQL> select x, y from
2 (
3 select *
4 from t
5 model
6 dimension by (x, y)
7 measures (0 z)
8 (z[any, any] order by x, y =
9 decode(max(z)[x<=cv(x),cv(y)]+max(z)[cv(x),y<=cv(y)],0,1,0))
10 )
11 where z = 1;
X Y
- ----------
A 1
B 2
C 4
D 5
E 6
2. Oracle recursive with
(Version 11gR2 and later)
with t0 as(select t.*, row_number() over (order by x, y) rn from t)
, r(rn, x, y, z, c, flag) as
(
select t0.rn, t0.x, t0.y, 1, ku$_objnumset(t0.y), 1
from t0
where rn = 1
union all
select t0.rn, t0.x, t0.y
, case when (r.x <> t0.x or flag = 0) and t0.y not member of c then 1 else 0 end
, case when (r.x <> t0.x or flag = 0) and t0.y not member of c then c multiset union ku$_objnumset(t0.y) else c end
, case when (r.x <> t0.x or flag = 0) and t0.y not member of c or (flag = 1 and r.x = t0.x) then 1 else 0 end
from t0
join r on r.rn + 1 = t0.rn
)
select x, y
from r
where z = 1
order by 1;
3. Oracle recursive with + cross apply
(Version 12cR1 and later)
with
t0 as
(select t.*, dense_rank() over(order by x) rnk from t),
r(rnk, x, y, c) as
(select 1, t0.x, min(t0.y), numbers(min(t0.y))
from t0
where rnk = 1
group by t0.rnk, t0.x
union all
select r.rnk + 1, t0.x, t0.y, decode(t0.y, null, r.c, r.c multiset union numbers(t0.y))
from r
cross apply (select min(t0.x) x, min(t0.y) y
from t0
where r.rnk + 1 = t0.rnk
and t0.y not member of r.c) t0
where rnk < (select max(rnk) from t0)
) cycle x set dummy to 1 default 0
select x, y from r where y is not null order by 1;
Oracle has some issues with detecting cycles when cross apply is used in recursive member so dummy column was added to avoid an exception (even though there is no actual cycle).
In all above approaches y
can be selected only if it has not been selected for previous X
s. If, however, the goal is to select next min y
greater than current y
, then below will also work.
4. Oracle recursive with and no collections to track allocated Y
s
with
t0 as
(select t.*, dense_rank() over(order by x) rnk from t),
r(rnk,y) as
(
select 1, min(t0.y)
from t0
where rnk = 1
union all
select r.rnk + 1,
nvl((select min(t0.y) y
from t0
where r.rnk + 1 = t0.rnk
and t0.y > r.y), r.y)
from r
where rnk < (select max(rnk) from t0)
)
select x, y
from t0
join (select y, min(rnk) rnk from r group by y) r using (rnk, y)
order by 1;
5. Oracle connect by
clause (limit on concatenated string is 4000 chars)
select regexp_substr(m, '~([^#]+)#\d+', 1, level, null, 1) x,
regexp_substr(m, '~[^#]+#(\d+)', 1, level, null, 1) y
from (select min(p) keep(dense_rank first order by l desc, rn) m
from (select rownum rn,
level l,
sys_connect_by_path(x || '#' || y, '~') p
from t
start with (x, y) in (select min(x), min(y) from t)
connect by prior x < x
and prior y < y
and rownum <= (select count(distinct x) from t)))
connect by regexp_substr(m, '~[^#]+#\d', 1, level) is not null;
And finally Spark + pure functional Scala! (not Spark SQL though)
val q = """
|select "A" x, 1 y
|union all select "A", 2
|union all select "A", 3
|union all select "B", 1
|union all select "B", 2
|union all select "B", 3
|union all select "C", 4
|union all select "D", 5
|union all select "D", 6
|union all select "E", 5
|union all select "E", 6
"""
val df = spark.sql(q)
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.DataFrame
import scala.annotation.tailrec
Transformation for a data frame.
def filterRows(df: DataFrame) = {
val schema = StructType(df.schema.fields :+ StructField("flag", IntegerType))
implicit val encoder = RowEncoder(schema)
def getFlag(chk: Boolean, y: Int, a: List[Int]): Integer =
if (chk && a.forall(_ != y)) 1 else null
def updAllocated(f: Integer, y: Int, l: List[Int]) = if (f == 1) y :: l else l
@tailrec def doIt_(iter: Iterator[Row], prevX: String, doCheck: Boolean,
allocated: List[Int], res: Iterator[Row]): Iterator[Row] = {
if (iter.hasNext) {
val r = iter.next
val curX: String = r.getAs("x")
val curY: Int = r.getAs("y")
val doCheck_ = doCheck || curX != prevX
val flag = getFlag(doCheck_, curY, allocated)
doIt_(iter, curX, doCheck_ && flag != 1, updAllocated(flag, curY, allocated),
res ++ Iterator(Row.fromSeq(r.toSeq :+ flag)))
} else res
}
def doIt(iter: Iterator[Row]): Iterator[Row] =
doIt_(iter, "", true, List[Int](), Iterator[Row]())
df.repartition(1).sortWithinPartitions($"x", $"y").mapPartitions(doIt).
filter("flag is not null").select($"x", $"y")
}
Test in Spark shell
scala> filterRows(df).show
+---+---+
| x| y|
+---+---+
| A| 1|
| B| 2|
| C| 4|
| D| 5|
| E| 6|
+---+---+
PS. Even most straightforward modification of CTE (option 4 adapted to MSSQL dialect) does not work in MSSQL 2017. Because group functions are not allowed in scalar subqueries in recursive member.
GROUP BY, HAVING, or aggregate functions are not allowed in the recursive part of a recursive common table expression 'r'.
After replacing min(t0.y)
with top 1 (t0.y)
it throws below exception
The TOP or OFFSET operator is not allowed in the recursive part of a recursive common table expression 'r'.
Option 2, however, can be adapted to MSSQL if we track allocated Y
s in concatenated string instead of a collection.
Upvotes: 2