mysqlsqlpostgresqlaggregate-functionsrelational-division

Reputation: 87

Find pair of students who take exactly the same classes

I have to find a pair of students who take exactly the same classes from table that has studentID and courseID.

studentID | courseID
    1           1
    1           2
    1           3
    2           1
    3           1
    3           2
    3           3

Query should return (1, 3).
The result also should not have duplicate rows such as (1,3) and (3,1).

Upvotes: 2

Answers (5)

Priyabrata Patnaik

Reputation: 11

Process to get this done in mysql

Create table student_course_agg 
( 
student_id int,
courses varchar(150)
);

INSERT INTO student_course_agg
select studentID ,GROUP_CONCAT(courseID ORDER BY courseID) courses
FROM STUDENTS 
GROUP BY 1;

SELECT master.student_id m_student_id,child.student_id c_student_id
FROM student_course_agg master 
JOIN student_course_ag child 
    ON master.student_id<child.student_id and master.courses=child.courses;

Direct query.

SELECT master.student_id m_student_id,child.student_id c_student_id
FROM (select studentID ,GROUP_CONCAT(courseID ORDER BY courseID) courses
FROM STUDENTS 
GROUP BY 1) master
JOIN (select studentID ,GROUP_CONCAT(courseID ORDER BY courseID) courses
FROM STUDENTS 
GROUP BY 1) child 
   ON master.studentID <child.studentID and master.courses=child.courses;

Upvotes: 0

wildplasser

Reputation: 44250

Naive relational division implementation, with CTE:

WITH pairs AS (
        SELECT DISTINCT a.student_id AS aaa
        , b.student_id AS bbb
        FROM student_course a
        JOIN student_course b ON a.course_id = b.course_id
        )
SELECT *
FROM pairs p
WHERE p.aaa < p.bbb
AND NOT EXISTS (
        SELECT * FROM student_course nx1
        WHERE nx1.student_id = p.aaa
        AND NOT EXISTS (
                SELECT * FROM student_course nx2
                WHERE nx2.student_id = p.bbb
                AND nx2.course_id = nx1.course_id
                )
        )
AND NOT EXISTS (
        SELECT * FROM student_course nx1
        WHERE nx1.student_id = p.bbb
        AND NOT EXISTS (
                SELECT * FROM student_course nx2
                WHERE nx2.student_id = p.aaa
                AND nx2.course_id = nx1.course_id
                )
        )
        ;

The same, without CTE's:

SELECT *
FROM (
        SELECT DISTINCT a.student_id AS aaa
        , b.student_id AS bbb
        FROM student_course a
        JOIN student_course b ON a.course_id = b.course_id
        ) p
WHERE p.aaa < p.bbb
AND NOT EXISTS (
        SELECT * FROM student_course nx1
        WHERE nx1.student_id = p.aaa
        AND NOT EXISTS (
                SELECT * FROM student_course nx2
                WHERE nx2.student_id = p.bbb
                AND nx2.course_id = nx1.course_id
                )
        )
AND NOT EXISTS (
        SELECT * FROM student_course nx1
        WHERE nx1.student_id = p.bbb
        AND NOT EXISTS (
                SELECT * FROM student_course nx2
                WHERE nx2.student_id = p.aaa
                AND nx2.course_id = nx1.course_id
                )
        )
        ;

The non-CTE version is faster, obviously.

Upvotes: 1

Erwin Brandstetter

Reputation: 656754

Test case:

I created a somewhat realistic test case:

CREATE TEMP TABLE student_course (
   student_id integer
  ,course_id integer
  ,PRIMARY KEY (student_id, course_id)
);

INSERT INTO student_course
SELECT *
FROM (VALUES (1, 1), (1, 2), (1, 3), (2, 1), (3, 1), (3, 2), (3, 3)) v
      -- to include some non-random values in test
UNION  ALL
SELECT DISTINCT student_id, normal_rand((random() * 30)::int, 1000, 35)::int
FROM   generate_series(4, 5000) AS student_id;

DELETE FROM student_course WHERE random() > 0.9; -- create some dead tuples
ANALYZE student_course; -- needed for temp table

Note the use of normal_rand() to populate the dummy table with a normal distribution of values. It's shipped with the tablefunc module, and since i am going to use that further down anyway ...

Also note the bold emphasis on the numbers I am going to manipulate for the benchmark to simulate various test cases.

Plain SQL

The question is rather basic and unclear. Find the first two students with matching courses? Or find all? Find couples of them or groups of students sharing the same courses? Craig answers to:
Find all couples sharing the same courses.

C1 - Craig's first query

Plain SQL With a CTE and grouping by arrays, slightly formatted:

WITH student_coursearray(student_id, courses) AS (
   SELECT student_id, array_agg(course_id ORDER BY course_id)
   FROM   student_course
   GROUP  BY student_id
   )
SELECT a.student_id, b.student_id
FROM   student_coursearray a
JOIN   student_coursearray b ON (a.courses = b.courses)
WHERE  a.student_id < b.student_id
ORDER  BY a.student_id, b.student_id;

The second query in Craig's answer dropped out of the race right away. With more than just a few rows, performance quickly deteriorates badly. The CROSS JOIN is poison.

E1 - Improved version

There is one major weakness, ORDER BY per aggregate is a bad performer, so I rewrote with ORDER BY in a subquery:

WITH cte AS (
   SELECT student_id, array_agg(course_id) AS courses
   FROM  (SELECT student_id, course_id FROM student_course ORDER BY 1, 2) sub
   GROUP  BY student_id
   )
SELECT a.student_id, b.student_id
FROM   cte a
JOIN   cte b USING (courses)
WHERE  a.student_id < b.student_id
ORDER  BY 1,2;

E2 - Alternative interpretation of question

I think the generally more useful case is:
Find all students sharing the same courses.
So I return arrays of students with matching courses.

WITH s AS (
   SELECT student_id, array_agg(course_id) AS courses
   FROM  (SELECT student_id, course_id FROM student_course ORDER BY 1, 2) sub
   GROUP  BY student_id
   )
SELECT array_agg(student_id)
FROM   s
GROUP  BY courses
HAVING count(*) > 1
ORDER    BY array_agg(student_id);

F1 - Dynamic PL/pgSQL function

To make this generic and fast I wrapped it into a plpgsql function with dynamic SQL:

CREATE OR REPLACE FUNCTION f_same_set(_tbl regclass, _id text, _match_id text)
  RETURNS SETOF int[] AS
$func$
BEGIN

RETURN QUERY EXECUTE format(
   $f$
   WITH s AS (
      SELECT %1$I AS id, array_agg(%2$I) AS courses
      FROM   (SELECT %1$I, %2$I FROM %3$s ORDER BY 1, 2) s
      GROUP  BY 1
      )
   SELECT array_agg(id)
   FROM   s
   GROUP  BY courses
   HAVING count(*) > 1
   ORDER    BY array_agg(id)
   $f$
   ,_id, _match_id, _tbl
   );
END
$func$  LANGUAGE plpgsql;

Call:

SELECT * FROM f_same_set('student_course', 'student_id', 'course_id');

Works for any table with numeric columns. It's trivial to extend for other data types, too.

`crosstab()`

For a relatively small number of courses (and arbitrarily big number of students) crosstab() provided by the additional tablefunc module is another option in PostgreSQL. More general info here:
PostgreSQL Crosstab Query

Simple case

A simple case for the simple example in the question, much like explained in the linked answer:

SELECT array_agg(student_id)
FROM   crosstab('
     SELECT student_id, course_id, TRUE
     FROM   student_course
     ORDER  BY 1'

   ,'VALUES (1),(2),(3)'
   )
AS t(student_id int, c1 bool, c2 bool, c3 bool)
GROUP  BY c1, c2, c3
HAVING count(*) > 1;

F2 - Dynamic crosstab function

For the simple case, the crosstab variant was faster, so I build a plpgsql function with dynamic SQL and included it in the test. Functionally identical with F1.

CREATE OR REPLACE FUNCTION f_same_set_x(_tbl regclass, _id text, _match_id text)
  RETURNS SETOF int[] AS
$func$
DECLARE
   _ids int[];   -- for array of match_ids (course_id in example)
BEGIN

-- Get list of match_ids
EXECUTE format(
   'SELECT array_agg(DISTINCT %1$I ORDER BY %1$I) FROM %2$s',_match_id, _tbl)
INTO _ids;

-- Main query
RETURN QUERY EXECUTE format(
   $f$
   SELECT array_agg(%1$I)
   FROM   crosstab('SELECT %1$I, %2$I, TRUE FROM %3$s ORDER BY 1'
                  ,'VALUES (%4$s)')
      AS t(student_id int, c%5$s  bool)
   GROUP  BY c%6$s
   HAVING count(*) > 1
   ORDER    BY array_agg(student_id)
   $f$
   ,_id
   ,_match_id
   ,_tbl
   ,array_to_string(_ids, '),(')     -- values
   ,array_to_string(_ids, ' bool,c') -- column def list
   ,array_to_string(_ids, ',c')      -- names
   );
END
$func$  LANGUAGE plpgsql;

Call:

SELECT * FROM f_same_set_x('student_course', 'student_id', 'course_id');

Benchmark

I tested on my small PostgreSQL test server. PostgreSQL 9.1.9 on Debian Linux on an ~ 6 years old AMD Opteron Server. I ran 5 test sets with the above settings and each of the presented queries. Best of 5 with EXPLAIN ANALYZE.

I used these values for the bold numbers in the above test case to populate:

nr. of students / max. nr. of courses / standard deviation (results in more distinct course_ids)
1. 1000 / 30 / 35
2. 5000 / 30 / 50
3. 10000 / 30 / 100
4. 10000 / 10 / 10
5. 10000 / 5 / 5

C1
1. Total runtime: 57 ms
2. Total runtime: 315 ms
3. Total runtime: 663 ms
4. Total runtime: 543 ms
5. Total runtime: 2345 ms (!) - deteriorates with many pairs

E1
1. Total runtime: 46 ms
2. Total runtime: 251 ms
3. Total runtime: 529 ms
4. Total runtime: 338 ms
5. Total runtime: 734 ms

E2
1. Total runtime: 45 ms
2. Total runtime: 245 ms
3. Total runtime: 515 ms
4. Total runtime: 218 ms
5. Total runtime: 143 ms

F1 victor
1. Total runtime: 14 ms
2. Total runtime: 77 ms
3. Total runtime: 166 ms
4. Total runtime: 80 ms
5. Total runtime: 54 ms

F2
1. Total runtime: 62 ms
2. Total runtime: 336 ms
3. Total runtime: 1053 ms (!) crosstab() deteriorates with many distinct values
4. Total runtime: 195 ms
5. Total runtime: 105 ms (!) but performs well with fewer distinct values

The PL/pgSQL function with dynamic SQL, sorting rows in a subquery is clear victor.

Upvotes: 1

Craig Ringer

Reputation: 324465

Given sample data:

CREATE TABLE student_course (
   student_id integer,
   course_id integer,
   PRIMARY KEY (student_id, course_id)
);

INSERT INTO student_course (student_id, course_id)
VALUES (1, 1), (1, 2), (1, 3), (2, 1), (3, 1), (3, 2), (3, 3) ;

Use array aggregation

One option is to use a CTE to join on the ordered lists of courses each student is taking:

WITH student_coursearray(student_id, courses) AS (
  SELECT student_id, array_agg(course_id ORDER BY course_id)
  FROM student_course
  GROUP BY student_id
)
SELECT a.student_id, b.student_id
FROM student_coursearray a INNER JOIN student_coursearray b ON (a.courses = b.courses)
WHERE a.student_id > b.student_id;

array_agg is actually part of the SQL standard, as is the WITH common-table expression syntax. Neither are supported by MySQL so you'll have to express this a different way if you want to support MySQL.

Find missing course pairings per-student

Another way to think about this would be "for every student pairing, find out if one is taking a class the other is not". This would lend its self to a FULL OUTER JOIN, but it's pretty awkward to express. You have to determine the pairings of student IDs of interest, then for each pairing do a full outer join across the set of classes each takes. If there are any null rows then one took a class the other didn't, so you can use that with a NOT EXISTS filter to exclude such pairings. That gives you this monster:

WITH student_id_pairs(left_student, right_student) AS (
  SELECT DISTINCT a.student_id, b.student_id
  FROM student_course a 
  INNER JOIN student_course b ON (a.student_id > b.student_id)
)
SELECT left_student, right_student 
FROM student_id_pairs 
WHERE NOT EXISTS (
  SELECT 1
  FROM (SELECT course_id FROM student_course WHERE student_id = left_student) a
  FULL OUTER JOIN (SELECT course_id FROM student_course b WHERE student_id = right_student) b
    ON (a.course_id = b.course_id)
  WHERE a.course_id IS NULL or b.course_id IS NULL
);

The CTE is optional and may be replaced by a CREATE TEMPORARY TABLE AS SELECT ... or whatever if your DB doesn't support CTEs.

Which to use?

I'm very confident that the array approach will perform better in all cases, particularly because for a really large data set you can take the WITH expression, create a temporary table from the query instead, add an index on (courses, student_id) to it and do crazy-fast equality searching that'll well and truly pay off the cost of the index creation time. You can't do that with the subquery joins approach.

Upvotes: 9

Praveen Prasannan

Reputation: 7123

select courses,group_concat(studentID) from
(select studentID, 
group_concat(courseID order by courseID) as courses
from Table1 group by studentID) abc
group by courses having courses like('%,%');

fiddle