fitifiti
fitifiti

Reputation: 97

SQL to pair columns exactly once

Given a table T with columns X and Y I want to get pairs for columns X and Y but once a value is used in a pair I don't want it to appear again. Example:

X   Y
-----
A   1
A   2
A   3
B   1
B   2
B   3
C   4
D   5
D   6
E   5
E   6

I want the SQL to output:

X   Y
-----
A   1
B   2
C   4
D   5
E   6

note that the value of 3 in Y needs to be ignored as there is no other matching value left for it on column X. The order needs to be preserved and the first matching pairs are valid.

Upvotes: 1

Views: 253

Answers (1)

Dr Y Wit
Dr Y Wit

Reputation: 2020

This can be solved in pure SQL only with advanced features of specific DB (like Oracle). Also this task is a good example to demonstrate when specific SQL features really do matter.

Below I'm going to demonstrate various approaches using Oracle and Spark.

create table t(x, y) as
(select 'A',   1 from dual
union all select 'A',   2 from dual
union all select 'A',   3 from dual
union all select 'B',   1 from dual
union all select 'B',   2 from dual
union all select 'B',   3 from dual
union all select 'C',   4 from dual
union all select 'D',   5 from dual
union all select 'D',   6 from dual
union all select 'E',   5 from dual
union all select 'E',   6 from dual);

1. Oracle model clause (Version 10gR1 and later)

SQL> select x, y from
  2  (
  3    select *
  4    from t
  5    model
  6    dimension by (x, y)
  7    measures (0 z)
  8    (z[any, any] order by x, y =
  9     decode(max(z)[x<=cv(x),cv(y)]+max(z)[cv(x),y<=cv(y)],0,1,0))
 10  )
 11  where z = 1;

X          Y
- ----------
A          1
B          2
C          4
D          5
E          6

2. Oracle recursive with (Version 11gR2 and later)

with t0 as(select t.*, row_number() over (order by x, y) rn from t)
, r(rn, x, y, z, c, flag) as
(
select t0.rn, t0.x, t0.y, 1, ku$_objnumset(t0.y), 1
from t0
where rn = 1
union all
select t0.rn, t0.x, t0.y
, case when (r.x <> t0.x or flag = 0) and t0.y not member of c then 1 else 0 end
, case when (r.x <> t0.x or flag = 0) and t0.y not member of c then c multiset union ku$_objnumset(t0.y) else c end
, case when (r.x <> t0.x or flag = 0) and t0.y not member of c or (flag = 1 and r.x = t0.x) then 1 else 0 end
from t0
join r on r.rn + 1 = t0.rn
)
select x, y
from r
where z = 1
order by 1;

3. Oracle recursive with + cross apply (Version 12cR1 and later)

with
t0 as
 (select t.*, dense_rank() over(order by x) rnk from t),
r(rnk, x, y, c) as
 (select 1, t0.x, min(t0.y), numbers(min(t0.y))
    from t0
   where rnk = 1
   group by t0.rnk, t0.x
  union all
  select r.rnk + 1, t0.x, t0.y, decode(t0.y, null, r.c, r.c multiset union numbers(t0.y))
    from r
   cross apply (select min(t0.x) x, min(t0.y) y
                  from t0
                 where r.rnk + 1 = t0.rnk
                   and t0.y not member of r.c) t0
   where rnk < (select max(rnk) from t0)
 ) cycle x set dummy to 1 default 0
select x, y from r where y is not null order by 1;

Oracle has some issues with detecting cycles when cross apply is used in recursive member so dummy column was added to avoid an exception (even though there is no actual cycle).

In all above approaches y can be selected only if it has not been selected for previous Xs. If, however, the goal is to select next min y greater than current y, then below will also work.

4. Oracle recursive with and no collections to track allocated Ys

with
t0 as
 (select t.*, dense_rank() over(order by x) rnk from t),
r(rnk,y) as
 (
 select 1, min(t0.y)
   from t0
  where rnk = 1
 union all
 select r.rnk + 1,
        nvl((select min(t0.y) y
              from t0
             where r.rnk + 1 = t0.rnk
               and t0.y > r.y), r.y)
   from r
  where rnk < (select max(rnk) from t0)
 )
select x, y
from t0
join (select y, min(rnk) rnk from r group by y) r using (rnk, y) 
order by 1;

5. Oracle connect by clause (limit on concatenated string is 4000 chars)

select regexp_substr(m, '~([^#]+)#\d+', 1, level, null, 1) x,
       regexp_substr(m, '~[^#]+#(\d+)', 1, level, null, 1) y
  from (select min(p) keep(dense_rank first order by l desc, rn) m
          from (select rownum rn,
                       level l,
                       sys_connect_by_path(x || '#' || y, '~') p
                  from t
                 start with (x, y) in (select min(x), min(y) from t)
                connect by prior x < x
                       and prior y < y
                       and rownum <= (select count(distinct x) from t)))
connect by regexp_substr(m, '~[^#]+#\d', 1, level) is not null;

And finally Spark + pure functional Scala! (not Spark SQL though)

val q = """
          |select "A" x, 1 y
          |union all select "A", 2
          |union all select "A", 3
          |union all select "B", 1
          |union all select "B", 2
          |union all select "B", 3
          |union all select "C", 4
          |union all select "D", 5
          |union all select "D", 6
          |union all select "E", 5
          |union all select "E", 6
        """
val df = spark.sql(q)

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.DataFrame
import scala.annotation.tailrec

Transformation for a data frame.

def filterRows(df: DataFrame) = {
  val schema = StructType(df.schema.fields :+ StructField("flag", IntegerType))
  implicit val encoder = RowEncoder(schema)

  def getFlag(chk: Boolean, y: Int, a: List[Int]): Integer =
    if (chk && a.forall(_ != y)) 1 else null
  def updAllocated(f: Integer, y: Int, l: List[Int]) = if (f == 1) y :: l else l

  @tailrec def doIt_(iter: Iterator[Row], prevX: String, doCheck: Boolean,
                     allocated: List[Int], res: Iterator[Row]): Iterator[Row] = {
    if (iter.hasNext) {
      val r = iter.next
      val curX: String = r.getAs("x")
      val curY: Int = r.getAs("y")
      val doCheck_ = doCheck || curX != prevX
      val flag = getFlag(doCheck_, curY, allocated)
      doIt_(iter, curX, doCheck_ && flag != 1, updAllocated(flag, curY, allocated),
        res ++ Iterator(Row.fromSeq(r.toSeq :+ flag)))
    } else res
  }

  def doIt(iter: Iterator[Row]): Iterator[Row] =
    doIt_(iter, "", true, List[Int](), Iterator[Row]())

  df.repartition(1).sortWithinPartitions($"x", $"y").mapPartitions(doIt).
    filter("flag is not null").select($"x", $"y")
}

Test in Spark shell

scala> filterRows(df).show
+---+---+
|  x|  y|
+---+---+
|  A|  1|
|  B|  2|
|  C|  4|
|  D|  5|
|  E|  6|
+---+---+

PS. Even most straightforward modification of CTE (option 4 adapted to MSSQL dialect) does not work in MSSQL 2017. Because group functions are not allowed in scalar subqueries in recursive member.

GROUP BY, HAVING, or aggregate functions are not allowed in the recursive part of a recursive common table expression 'r'.

After replacing min(t0.y) with top 1 (t0.y) it throws below exception

The TOP or OFFSET operator is not allowed in the recursive part of a recursive common table expression 'r'.

Option 2, however, can be adapted to MSSQL if we track allocated Ys in concatenated string instead of a collection.

Upvotes: 2

Related Questions